From python at mrabarnett.plus.com Mon Aug 1 00:52:04 2011 From: python at mrabarnett.plus.com (MRAB) Date: Sun, 31 Jul 2011 23:52:04 +0100 Subject: [Python-Dev] urllib bug in Python 3.2.1? Message-ID: <4E35DC94.2090208@mrabarnett.plus.com> Someone over at StackOverflow has a problem with urlopen in Python 3.2.1: http://stackoverflow.com/questions/6892573/problem-with-urlopen/6892843#6892843 This is the code: from urllib.request import urlopen f = urlopen('http://online.wsj.com/mdc/public/page/2_3020-tips.html?mod=topnav_2_3000') page = f.read() f.close() With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the read returns an empty string (I checked it myself). From nad at acm.org Mon Aug 1 01:06:48 2011 From: nad at acm.org (Ned Deily) Date: Sun, 31 Jul 2011 16:06:48 -0700 Subject: [Python-Dev] urllib bug in Python 3.2.1? References: <4E35DC94.2090208@mrabarnett.plus.com> Message-ID: In article <4E35DC94.2090208 at mrabarnett.plus.com>, MRAB wrote: > Someone over at StackOverflow has a problem with urlopen in Python 3.2.1: > > > http://stackoverflow.com/questions/6892573/problem-with-urlopen/6892843#689284 > 3 > > This is the code: > > from urllib.request import urlopen > f = > urlopen('http://online.wsj.com/mdc/public/page/2_3020-tips.html?mod=topnav_2_3 > 000') > page = f.read() > f.close() > > With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the > read returns an empty string (I checked it myself). http://bugs.python.org/issue12576 -- Ned Deily, nad at acm.org From rdmurray at bitdance.com Tue Aug 2 04:22:20 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 01 Aug 2011 22:22:20 -0400 Subject: [Python-Dev] =?utf8?q?=5BPython-checkins=5D_cpython_=283=2E2=29?= =?utf8?q?=3A_Skip_test=5Fgetsetlocale=5Fissue1813=28=29_on_Fedora_?= =?utf8?q?due_to_setlocale=28=29_bug=2E?= In-Reply-To: References: Message-ID: <20110802022221.9FE582506C6@webabinitio.net> On Tue, 02 Aug 2011 01:22:03 +0200, stefan.krah wrote: > http://hg.python.org/cpython/rev/68b5f87566fb > changeset: 71683:68b5f87566fb > branch: 3.2 > parent: 71679:1f9ca1819d7c > user: Stefan Krah > date: Tue Aug 02 01:06:16 2011 +0200 > summary: > Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. > See: https://bugzilla.redhat.com/show_bug.cgi?id=726536 > > files: > Lib/test/test_locale.py | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > > diff --git a/Lib/test/test_locale.py b/Lib/test/test_locale.py > --- a/Lib/test/test_locale.py > +++ b/Lib/test/test_locale.py > @@ -1,4 +1,5 @@ > from test.support import run_unittest, verbose > +from platform import linux_distribution > import unittest > import locale > import sys > @@ -391,6 +392,8 @@ > # crasher from bug #7419 > self.assertRaises(locale.Error, locale.setlocale, 12345) > > + @unittest.skipIf(linux_distribution()[0] == 'Fedora', "Fedora setlocale() " > + "bug: https://bugzilla.redhat.com/show_bug.cgi?id=726536") > def test_getsetlocale_issue1813(self): > # Issue #1813: setting and getting the locale under a Turkish locale > oldlocale = locale.setlocale(locale.LC_CTYPE) Why 'Fedora'? This bug affects more than just Fedora: as I reported on the issue, I'm seeing it on Gentoo as well. (Also, including the issue number in the commit message is helpful). Note that since the bug report says that "Gentoo has been including this fix for two years", the fact that it is failing on my Gentoo system would seem to indicate that something about the fix is not right. So, I'm not sure this skip is even valid. I'm not sure we've finished diagnosing the bug. If there are any helpful tests I can run on Gentoo, please let me know. -- R. David Murray http://www.bitdance.com From nadeem.vawda at gmail.com Tue Aug 2 10:17:52 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Tue, 2 Aug 2011 10:17:52 +0200 Subject: [Python-Dev] [Python-checkins] cpython: Issue #11651: Move options for running tests into a Python script. In-Reply-To: <4E372CB3.60408@udel.edu> References: <4E372CB3.60408@udel.edu> Message-ID: Thanks for catching that. Fixed in 0b52b6f1bfab. Nadeem From stefan at bytereef.org Tue Aug 2 10:48:07 2011 From: stefan at bytereef.org (Stefan Krah) Date: Tue, 2 Aug 2011 10:48:07 +0200 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. In-Reply-To: <20110802022221.9FE582506C6@webabinitio.net> References: <20110802022221.9FE582506C6@webabinitio.net> Message-ID: <20110802084807.GA1830@sleipnir.bytereef.org> R. David Murray wrote: > On Tue, 02 Aug 2011 01:22:03 +0200, stefan.krah wrote: > > Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. > > See: https://bugzilla.redhat.com/show_bug.cgi?id=726536 > > + @unittest.skipIf(linux_distribution()[0] == 'Fedora', "Fedora setlocale() " > > + "bug: https://bugzilla.redhat.com/show_bug.cgi?id=726536") > > Why 'Fedora'? This bug affects more than just Fedora: as I reported on > the issue, I'm seeing it on Gentoo as well. (Also, including the issue > number in the commit message is helpful). > > Note that since the bug report says that "Gentoo has been including this > fix for two years", the fact that it is failing on my Gentoo system > would seem to indicate that something about the fix is not right. > > So, I'm not sure this skip is even valid. I'm not sure we've finished > diagnosing the bug. Fedora's glibc has an additional issue with the Turkish 'I' that can be reproduced by the simple C program in: https://bugzilla.redhat.com/show_bug.cgi?id=726536 I disabled the test specifically on Fedora because it a) seems to be the only Linux buildbot where this test fails and b) this does not seem like a Python issue to me. Since you say that the fix for issue #1813 might not be right, do you think that the fix should work around this glibc issue? > If there are any helpful tests I can run on Gentoo, please let me know. Yes, you could run the small test program. If you get the same results as on Fedora, then I wonder why the Gentoo buildbots are green. Do they have tr_TR and tr_TR.iso88599 installed? Stefan Krah From ronaldoussoren at mac.com Tue Aug 2 10:40:20 2011 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Tue, 02 Aug 2011 10:40:20 +0200 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. In-Reply-To: <20110802022221.9FE582506C6@webabinitio.net> References: <20110802022221.9FE582506C6@webabinitio.net> Message-ID: On 2 Aug, 2011, at 4:22, R. David Murray wrote: > On Tue, 02 Aug 2011 01:22:03 +0200, stefan.krah wrote: >> http://hg.python.org/cpython/rev/68b5f87566fb >> changeset: 71683:68b5f87566fb >> branch: 3.2 >> parent: 71679:1f9ca1819d7c >> user: Stefan Krah >> date: Tue Aug 02 01:06:16 2011 +0200 >> summary: >> Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. >> See: https://bugzilla.redhat.com/show_bug.cgi?id=726536 >> >> files: >> Lib/test/test_locale.py | 3 +++ >> 1 files changed, 3 insertions(+), 0 deletions(-) >> >> >> diff --git a/Lib/test/test_locale.py b/Lib/test/test_locale.py >> --- a/Lib/test/test_locale.py >> +++ b/Lib/test/test_locale.py >> @@ -1,4 +1,5 @@ >> from test.support import run_unittest, verbose >> +from platform import linux_distribution >> import unittest >> import locale >> import sys >> @@ -391,6 +392,8 @@ >> # crasher from bug #7419 >> self.assertRaises(locale.Error, locale.setlocale, 12345) >> >> + @unittest.skipIf(linux_distribution()[0] == 'Fedora', "Fedora setlocale() " >> + "bug: https://bugzilla.redhat.com/show_bug.cgi?id=726536") >> def test_getsetlocale_issue1813(self): >> # Issue #1813: setting and getting the locale under a Turkish locale >> oldlocale = locale.setlocale(locale.LC_CTYPE) > > Why 'Fedora'? This bug affects more than just Fedora: as I reported on > the issue, I'm seeing it on Gentoo as well. (Also, including the issue > number in the commit message is helpful). > > Note that since the bug report says that "Gentoo has been including this > fix for two years", the fact that it is failing on my Gentoo system > would seem to indicate that something about the fix is not right. > > So, I'm not sure this skip is even valid. I'm not sure we've finished > diagnosing the bug. > > If there are any helpful tests I can run on Gentoo, please let me know. Wouldn't it be better to mark this as an expected failure on the affected platforms? Skipping the test unconditionally will skip the test even when Fedora gets around to fixing this issue. Ronald > > -- > R. David Murray http://www.bitdance.com > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2224 bytes Desc: not available URL: From stefan at bytereef.org Tue Aug 2 12:12:37 2011 From: stefan at bytereef.org (Stefan Krah) Date: Tue, 2 Aug 2011 12:12:37 +0200 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. In-Reply-To: <20110802084807.GA1830@sleipnir.bytereef.org> References: <20110802022221.9FE582506C6@webabinitio.net> <20110802084807.GA1830@sleipnir.bytereef.org> Message-ID: <20110802101237.GA12366@sleipnir.bytereef.org> Stefan Krah wrote: > Fedora's glibc has an additional issue with the Turkish 'I' that can > be reproduced by the simple C program in: > > https://bugzilla.redhat.com/show_bug.cgi?id=726536 OK, this runs successfully on Ubuntu Lucid and FreeBSD (if you change the first tr_TR to tr_TR.UTF-8). But it fails on Debian lenny, as does test_getsetlocale_issue1813(). I suspect many buildbots are green because they don't have tr_TR and tr_TR.iso8859-9 installed. Synopsis for the people who don't want to wade through the bug reports: If this is a valid C program ... #include #include int main(void) { char *s; printf("%s\n", setlocale(LC_CTYPE, "tr_TR")); printf("%s\n", setlocale(LC_CTYPE, NULL)); s = setlocale(LC_CTYPE, "tr_TR.ISO8859-9"); printf("%s\n", s ? s : "null"); return 0; } ..., several systems (Fedora 14, Debian lenny) have a glibc bug that is exposed by test_getsetlocale_issue1813(). People usually don't see this because tr_TR and tr_TR.iso8859-9 aren't installed. Stefan Krah From scott+python-dev at scottdial.com Tue Aug 2 12:20:47 2011 From: scott+python-dev at scottdial.com (Scott Dial) Date: Tue, 02 Aug 2011 06:20:47 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. In-Reply-To: <20110802084807.GA1830@sleipnir.bytereef.org> References: <20110802022221.9FE582506C6@webabinitio.net> <20110802084807.GA1830@sleipnir.bytereef.org> Message-ID: <4E37CF7F.2020900@scottdial.com> On 8/2/2011 4:48 AM, Stefan Krah wrote: > R. David Murray wrote: >> If there are any helpful tests I can run on Gentoo, please let me know. > > Yes, you could run the small test program. If you get the same results > as on Fedora, then I wonder why the Gentoo buildbots are green. > > Do they have tr_TR and tr_TR.iso88599 installed? Highly doubtful. It is a normal part of the Gentoo install process to select the locales that you want for the system. Even the example list of locales doesn't include any Turkish locales, so one would've had to gone to specific effort to add that one. -- Scott Dial scott at scottdial.com From rdmurray at bitdance.com Tue Aug 2 13:31:46 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 02 Aug 2011 07:31:46 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Skip test_getsetlocale_issue1813() on Fedora due to setlocale() bug. In-Reply-To: <20110802101237.GA12366@sleipnir.bytereef.org> References: <20110802022221.9FE582506C6@webabinitio.net> <20110802084807.GA1830@sleipnir.bytereef.org> <20110802101237.GA12366@sleipnir.bytereef.org> Message-ID: <20110802113146.E6D312506C6@webabinitio.net> On Tue, 02 Aug 2011 12:12:37 +0200, Stefan Krah wrote: > Stefan Krah wrote: > > Fedora's glibc has an additional issue with the Turkish 'I' that can > > be reproduced by the simple C program in: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=726536 > > OK, this runs successfully on Ubuntu Lucid and FreeBSD (if you change > the first tr_TR to tr_TR.UTF-8). > > But it fails on Debian lenny, as does test_getsetlocale_issue1813(). > > I suspect many buildbots are green because they don't have tr_TR and > tr_TR.iso8859-9 installed. This is true for my Gentoo buildbots. Once we've figured out the best way to handle this, I'll fix that (install the other locales) for my two. > Synopsis for the people who don't want to wade through the bug reports: > > If this is a valid C program ... > > #include > #include > int > main(void) > { > char *s; > printf("%s\n", setlocale(LC_CTYPE, "tr_TR")); > printf("%s\n", setlocale(LC_CTYPE, NULL)); > s = setlocale(LC_CTYPE, "tr_TR.ISO8859-9"); > printf("%s\n", s ? s : "null"); > return 0; > } > > ..., several systems (Fedora 14, Debian lenny) have a glibc bug that > is exposed by test_getsetlocale_issue1813(). People usually don't > see this because tr_TR and tr_TR.iso8859-9 aren't installed. I get null as the final output of that regardless of whether I use 'tr_TR' or 'tr_TR.utf8'. This is with glibc-2.13-r2 (the r2 is Gentoo's mod number). I'll attach this to the bug report, too, perhaps the discussion should move there. -- R. David Murray http://www.bitdance.com From solipsis at pitrou.net Tue Aug 2 14:16:01 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 2 Aug 2011 14:16:01 +0200 Subject: [Python-Dev] cpython (3.2): Fix closes Issue12676 - Invalid identifier used in TypeError message in References: Message-ID: <20110802141601.3a09784a@pitrou.net> On Tue, 02 Aug 2011 12:33:55 +0200 senthil.kumaran wrote: > raise TypeError("data should be a bytes-like object\ > - or an iterable, got %r " % type(it)) > + or an iterable, got %r " % type(data)) There are still a lot of spaces in your message. You should use string literal concatenation instead: raise TypeError( "data should be a bytes-like object " "or an iterable, got %r" % type(data)) From chris at simplistix.co.uk Tue Aug 2 19:48:11 2011 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 02 Aug 2011 18:48:11 +0100 Subject: [Python-Dev] email-6.0.0.a1 In-Reply-To: <20110719212139.D5D732500D5@webabinitio.net> References: <20110719212139.D5D732500D5@webabinitio.net> Message-ID: <4E38385B.4080201@simplistix.co.uk> On 19/07/2011 22:21, R. David Murray wrote: > The basic additional API is that a 'source' attribute contains the > text the generator read from the input source, and a 'value' attribute > that contains the value with all the Content-Transfer-Encoding stuff > undone so that you have a real unicode string. By changing a policy > setting, you can have that value as the string value of the header. > You can also assign a string with non-ASCII characters to a header, and > the right thing will happen. (Well, eventually it will happen...right > now it only works correctly for unstructured headers). Further, Date > headers have a datetime attribute (and accept being set to a datetime), > and address headers have attributes for accessing the individual addresses > in the header. Other structured headers will eventually grow additional > attributes as well. This all sounds pretty awesome, congrats :-) Has the header wrapping bug that was all part of the big headers mess been resolved now? cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From rdmurray at bitdance.com Tue Aug 2 23:27:06 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 02 Aug 2011 17:27:06 -0400 Subject: [Python-Dev] email-6.0.0.a1 In-Reply-To: <4E38385B.4080201@simplistix.co.uk> References: <20110719212139.D5D732500D5@webabinitio.net> <4E38385B.4080201@simplistix.co.uk> Message-ID: <20110802212706.E6099B14005@webabinitio.net> On Tue, 02 Aug 2011 18:48:11 +0100, Chris Withers wrote: > On 19/07/2011 22:21, R. David Murray wrote: > > The basic additional API is that a 'source' attribute contains the > > text the generator read from the input source, and a 'value' attribute > > that contains the value with all the Content-Transfer-Encoding stuff > > undone so that you have a real unicode string. By changing a policy > > setting, you can have that value as the string value of the header. > > You can also assign a string with non-ASCII characters to a header, and > > the right thing will happen. (Well, eventually it will happen...right > > now it only works correctly for unstructured headers). Further, Date > > headers have a datetime attribute (and accept being set to a datetime), > > and address headers have attributes for accessing the individual addresses > > in the header. Other structured headers will eventually grow additional > > attributes as well. > > This all sounds pretty awesome, congrats :-) > > Has the header wrapping bug that was all part of the big headers mess > been resolved now? If it is the bug I think you are talking about, it was resolved in 3.2.1. If there's still an open header wrapping bug (other than the one about smime and spaces after the ':') please let me know the issue number, as I don't see any in my list. There may still be an issue with whitespace padding in the encoded word context; I haven't tested issue 1467619 since I made my other changes. If it is not fixed in 3.2.1 already, it will be fixed in email6 by the time I finish the new wrapping code for that. --David From solipsis at pitrou.net Wed Aug 3 00:42:28 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 3 Aug 2011 00:42:28 +0200 Subject: [Python-Dev] cpython: NEWS note for bbeda42ea6a8 References: Message-ID: <20110803004228.5bde2dc7@pitrou.net> On Wed, 03 Aug 2011 00:30:41 +0200 benjamin.peterson wrote: > > diff --git a/Misc/NEWS b/Misc/NEWS > --- a/Misc/NEWS > +++ b/Misc/NEWS > @@ -10,6 +10,8 @@ > Core and Builtins > ----------------- > > +- Add ThreadError to threading.__all__. > + This should surely be in the library section. Regards Antoine. From eric at trueblade.com Wed Aug 3 05:06:42 2011 From: eric at trueblade.com (Eric Smith) Date: Tue, 02 Aug 2011 23:06:42 -0400 Subject: [Python-Dev] Fwd: [Python-checkins] devguide: Add Sandro to the list of core developers Message-ID: <4E38BB42.6060009@trueblade.com> Speaking of developers.rst, could whoever added Jason Coombs also update developers.rst? I've added Jason to the committers mailing list. Thanks. Eric. -------- Original Message -------- Subject: [Python-checkins] devguide: Add Sandro to the list of core developers Date: Tue, 02 Aug 2011 14:58:38 +0200 From: antoine.pitrou Reply-To: python-dev at python.org To: python-checkins at python.org http://hg.python.org/devguide/rev/2783106b0ccc changeset: 438:2783106b0ccc user: Antoine Pitrou date: Tue Aug 02 14:56:56 2011 +0200 summary: Add Sandro to the list of core developers files: developers.rst | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/developers.rst b/developers.rst --- a/developers.rst +++ b/developers.rst @@ -24,6 +24,10 @@ Permissions History ------------------- +- Sandro Tosi was given push privileges on Aug 1 2011 by Antoine Pitrou, + for documentation and other contributions, on recommendation by Ezio + Melotti, R. David Murray and others. + - Charles-Fran?ois Natali was given push privileges on May 19 2011 by Antoine Pitrou, for general contributions, on recommandation by Victor Stinner, Brian Curtin and others. -- Repository URL: http://hg.python.org/devguide From g.brandl at gmx.net Wed Aug 3 08:22:54 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Wed, 03 Aug 2011 08:22:54 +0200 Subject: [Python-Dev] cpython: expose sched.h functions (closes #12655) In-Reply-To: References: Message-ID: Am 03.08.2011 00:30, schrieb benjamin.peterson: > http://hg.python.org/cpython/rev/89e92e684b37 > changeset: 71704:89e92e684b37 > user: Benjamin Peterson > date: Tue Aug 02 17:30:04 2011 -0500 > summary: > expose sched.h functions (closes #12655) > +static PyObject * > +posix_sched_setaffinity(PyObject *self, PyObject *args) > +{ > + pid_t pid; > + Py_cpu_set *cpu_set; > + > + if (!PyArg_ParseTuple(args, _Py_PARSE_PID "O!|sched_setaffinity", [...] > +static PyObject * > +posix_sched_getaffinity(PyObject *self, PyObject *args) > +{ > + pid_t pid; > + int ncpus; > + Py_cpu_set *res; > + > + if (!PyArg_ParseTuple(args, _Py_PARSE_PID "i|sched_getaffinity", These should be separated by ":", not "|", if I'm not mistaken? Georg From solipsis at pitrou.net Wed Aug 3 15:23:19 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 3 Aug 2011 15:23:19 +0200 Subject: [Python-Dev] cpython (3.2): Fix closes issue12683 - urljoin to work with relative join of svn scheme. References: Message-ID: <20110803152319.3844f598@pitrou.net> On Wed, 03 Aug 2011 12:47:23 +0200 senthil.kumaran wrote: > > diff --git a/Lib/test/test_urlparse.py b/Lib/test/test_urlparse.py > --- a/Lib/test/test_urlparse.py > +++ b/Lib/test/test_urlparse.py > @@ -371,6 +371,8 @@ > self.checkJoin('http:///', '..','http:///') > self.checkJoin('', 'http://a/b/c/g?y/./x','http://a/b/c/g?y/./x') > self.checkJoin('', 'http://a/./g', 'http://a/./g') > + self.checkJoin('svn://pathtorepo/dir1', 'dir2', 'svn://pathtorepo/dir2') > + self.checkJoin('svn://pathtorepo/dir1', 'dir2', 'svn://pathtorepo/dir2') This is the same test repeated. Perhaps you meant svn+ssh? Regards Antoine. From senthil at uthcode.com Wed Aug 3 15:56:57 2011 From: senthil at uthcode.com (Senthil Kumaran) Date: Wed, 3 Aug 2011 21:56:57 +0800 Subject: [Python-Dev] cpython (3.2): Fix closes Issue12676 - Invalid identifier used in TypeError message in In-Reply-To: <20110802141601.3a09784a@pitrou.net> References: <20110802141601.3a09784a@pitrou.net> Message-ID: <20110803135657.GA2477@mathmagic> On Tue, Aug 02, 2011 at 02:16:01PM +0200, Antoine Pitrou wrote: > There are still a lot of spaces in your message. You should use string Yes, did not realize that.. :( Georg fixed this in his commit. Thanks, Senthil From senthil at uthcode.com Wed Aug 3 15:57:35 2011 From: senthil at uthcode.com (Senthil Kumaran) Date: Wed, 3 Aug 2011 21:57:35 +0800 Subject: [Python-Dev] cpython (3.2): Fix closes issue12683 - urljoin to work with relative join of svn scheme. In-Reply-To: <20110803152319.3844f598@pitrou.net> References: <20110803152319.3844f598@pitrou.net> Message-ID: <20110803135735.GB2477@mathmagic> On Wed, Aug 03, 2011 at 03:23:19PM +0200, Antoine Pitrou wrote: > This is the same test repeated. Perhaps you meant svn+ssh? oops, thanks for the catch. yes, I did mean svn+ssh. I shall change it. -- Senthil From ethan at stoneleaf.us Wed Aug 3 22:36:00 2011 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 03 Aug 2011 13:36:00 -0700 Subject: [Python-Dev] unittest bug Message-ID: <4E39B130.4080504@stoneleaf.us> My apologies for posting here first, but I'm not yet confident enough in my bug searching fu, and duplicates are a pain. Here's the issue: from unittest import * class MyTest(TestCase): def test_add(self): self.assertEqual(1,(2-1),"Sample Subraction Test") if __name__ == '__main__': main() I know this isn't the normal way to use unittest, but since __init__ goes to the trouble of defining __all__ I would think it was supported. However, it doesn't work -- I added some print statements to show where the problem lies (in unittest.loader.TestLoader.loadTestsFromModule): ---------------------------------------------------------------------- checking added checking added checking checking added checking checking checking checking checking checking checking None checking None checking 'test_add.py' checking '__main__' checking None checking checking checking checking checking checking checking checking checking checking checking checking checking test = ]>, ]>, ]> --------------------------------------------------------------------- compared with running using the `import unittest` method: --------------------------------------------------------------------- checking added checking checking None checking None checking 'test_add_right.py' checking '__main__' checking None checking test = ]>]> --------------------------------------------------------------------- As you can see, the TestLoader is getting false positives from case.FunctionTestCase and case.TestCase. This a problem because, besides running more tests than it should, this happens: E. ====================================================================== Traceback (most recent call last): File "test_add.py", line 8, in main() File "C:\python32\lib\unittest\main.py", line 125, in __init__ self.runTests() File "C:\python32\lib\unittest\main.py", line 271, in runTests self.result = testRunner.run(self.test) File "C:\python32\lib\unittest\runner.py", line 175, in run result.printErrors() File "C:\python32\lib\unittest\runner.py", line 109, in printErrors self.printErrorList('ERROR', self.errors) File "C:\python32\lib\unittest\runner.py", line 115, in printErrorList self.stream.writeln("%s: %s" % (flavour,self.getDescription(test))) File "C:\python32\lib\unittest\runner.py", line 47, in getDescription return '\n'.join((str(test), doc_first_line)) File "C:\python32\lib\unittest\case.py", line 1246, in __str__ self._testFunc.__name__) AttributeError: 'str' object has no attribute '__name__' I'll be happy to file a bug report if someone can confirm this hasn't already been filed. Thanks for the help! ~Ethan~ PS No, that's not my code. ;) From fuzzyman at voidspace.org.uk Wed Aug 3 23:32:22 2011 From: fuzzyman at voidspace.org.uk (Michael Foord) Date: Wed, 3 Aug 2011 22:32:22 +0100 Subject: [Python-Dev] unittest bug In-Reply-To: <4E39B130.4080504@stoneleaf.us> References: <4E39B130.4080504@stoneleaf.us> Message-ID: <0EFECB4D-72D8-4B71-B3CD-C28B20A662C3@voidspace.org.uk> On 3 Aug 2011, at 21:36, Ethan Furman wrote: > My apologies for posting here first, but I'm not yet confident enough in my bug searching fu, and duplicates are a pain. > > Here's the issue: > > from unittest import * That's the bug right there. Just import TestCase and main and everything should work fine. Using "import *" is not recommended except at the interactive interpreter and it doesn't play well with unittest.main which does magic introspection to find tests to run. Michael > class MyTest(TestCase): > def test_add(self): > self.assertEqual(1,(2-1),"Sample Subraction Test") > > > if __name__ == '__main__': > main() > > I know this isn't the normal way to use unittest, but since __init__ goes to the trouble of defining __all__ I would think it was supported. However, it doesn't work -- I added some print statements to show where the problem lies (in unittest.loader.TestLoader.loadTestsFromModule): > > ---------------------------------------------------------------------- > checking > added > checking > added > checking > checking > added > checking > checking > checking > checking > checking > checking > checking None > checking None > checking 'test_add.py' > checking '__main__' > checking None > checking > checking > checking > checking > checking > checking > checking > checking > checking > checking > checking > checking > checking > > test = > ]>, MyTest testMethod=test_add>]>, ]> > --------------------------------------------------------------------- > > compared with running using the `import unittest` method: > --------------------------------------------------------------------- > checking > added > checking > checking None > checking None > checking 'test_add_right.py' > checking '__main__' > checking None > checking > > test = > ]>]> > --------------------------------------------------------------------- > > As you can see, the TestLoader is getting false positives from case.FunctionTestCase and case.TestCase. This a problem because, besides running more tests than it should, this happens: > > E. > ====================================================================== > Traceback (most recent call last): > File "test_add.py", line 8, in > main() > File "C:\python32\lib\unittest\main.py", line 125, in __init__ > self.runTests() > File "C:\python32\lib\unittest\main.py", line 271, in runTests > self.result = testRunner.run(self.test) > File "C:\python32\lib\unittest\runner.py", line 175, in run > result.printErrors() > File "C:\python32\lib\unittest\runner.py", line 109, in printErrors > self.printErrorList('ERROR', self.errors) > File "C:\python32\lib\unittest\runner.py", line 115, in printErrorList > self.stream.writeln("%s: %s" % (flavour,self.getDescription(test))) > File "C:\python32\lib\unittest\runner.py", line 47, in getDescription > return '\n'.join((str(test), doc_first_line)) > File "C:\python32\lib\unittest\case.py", line 1246, in __str__ > self._testFunc.__name__) > AttributeError: 'str' object has no attribute '__name__' > > I'll be happy to file a bug report if someone can confirm this hasn't already been filed. > > Thanks for the help! > > ~Ethan~ > > PS > No, that's not my code. ;) > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk > -- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html From ethan at stoneleaf.us Wed Aug 3 23:58:31 2011 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 03 Aug 2011 14:58:31 -0700 Subject: [Python-Dev] unittest bug In-Reply-To: <0EFECB4D-72D8-4B71-B3CD-C28B20A662C3@voidspace.org.uk> References: <4E39B130.4080504@stoneleaf.us> <0EFECB4D-72D8-4B71-B3CD-C28B20A662C3@voidspace.org.uk> Message-ID: <4E39C487.9090403@stoneleaf.us> Michael Foord wrote: > On 3 Aug 2011, at 21:36, Ethan Furman wrote: >> My apologies for posting here first, but I'm not yet confident enough in my bug searching fu, and duplicates are a pain. >> >> Here's the issue: >> >> from unittest import * > > That's the bug right there. Just import TestCase and main and everything should work fine. Using "import *" is not recommended except at the interactive interpreter and it doesn't play well with unittest.main which does magic introspection to find tests to run. If from xxx import * is not supported, why provide __all__? At the very least the lack of a warning is a documentation bug. ~Ethan~ From fuzzyman at voidspace.org.uk Wed Aug 3 23:44:50 2011 From: fuzzyman at voidspace.org.uk (Michael Foord) Date: Wed, 3 Aug 2011 22:44:50 +0100 Subject: [Python-Dev] unittest bug In-Reply-To: <4E39C487.9090403@stoneleaf.us> References: <4E39B130.4080504@stoneleaf.us> <0EFECB4D-72D8-4B71-B3CD-C28B20A662C3@voidspace.org.uk> <4E39C487.9090403@stoneleaf.us> Message-ID: On 3 Aug 2011, at 22:58, Ethan Furman wrote: > Michael Foord wrote: >> On 3 Aug 2011, at 21:36, Ethan Furman wrote: >>> My apologies for posting here first, but I'm not yet confident enough in my bug searching fu, and duplicates are a pain. >>> >>> Here's the issue: >>> >>> from unittest import * >> That's the bug right there. Just import TestCase and main and everything should work fine. Using "import *" is not recommended except at the interactive interpreter and it doesn't play well with unittest.main which does magic introspection to find tests to run. > > If from xxx import * is not supported, why provide __all__? a) to define the public API b) to limit the symbols exported - that is not the same as having main(?) work with import *, they're orthogonal > At the very least the lack of a warning is a documentation bug. > Feel free to propose a patch fixing that problem (on the issue tracker please). All the best, Michael Foord > ~Ethan~ > -- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html From ethan at stoneleaf.us Thu Aug 4 01:00:52 2011 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 03 Aug 2011 16:00:52 -0700 Subject: [Python-Dev] unittest bug In-Reply-To: References: <4E39B130.4080504@stoneleaf.us> <0EFECB4D-72D8-4B71-B3CD-C28B20A662C3@voidspace.org.uk> <4E39C487.9090403@stoneleaf.us> Message-ID: <4E39D324.6050600@stoneleaf.us> Michael Foord wrote: > On 3 Aug 2011, at 22:58, Ethan Furman wrote: >> Michael Foord wrote: >>> On 3 Aug 2011, at 21:36, Ethan Furman wrote: >>>> My apologies for posting here first, but I'm not yet confident enough in my bug searching fu, and duplicates are a pain. >>>> >>>> Here's the issue: >>>> >>>> from unittest import * >>> >>> That's the bug right there. Just import TestCase and main and everything should work fine. Using "import *" is not recommended except at the interactive interpreter and it doesn't play well with unittest.main which does magic introspection to find tests to run. >> >> If from xxx import * is not supported, why provide __all__? > > a) to define the public API In trying to refute this, I found http://docs.python.org/py3k/reference/simple_stmts.html?highlight=__all__#the-import-statement and learned something new. Thanks, Michael! I think I'll withdraw my bug report, however -- since `from ... import *` is already noted as usually bad practice it should fall on the shoulders of the modules where it is /supposed/ to work to advertise that, and absent any such advertisement it should not be used. ~Ethan~ From g.brandl at gmx.net Thu Aug 4 07:42:22 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Thu, 04 Aug 2011 07:42:22 +0200 Subject: [Python-Dev] Daily reference leaks (65c412586901): sum=0 In-Reply-To: References: Message-ID: Am 04.08.2011 05:25, schrieb solipsis at pitrou.net: > results for 65c412586901 on branch "default" > -------------------------------------------- > > > > Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R', > '3:3:/home/antoine/cpython/refleaks/reflogso7nu3', '-x'] Do we need this mail even if there are no leaks to report? Georg From ncoghlan at gmail.com Thu Aug 4 07:54:54 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 4 Aug 2011 15:54:54 +1000 Subject: [Python-Dev] Daily reference leaks (65c412586901): sum=0 In-Reply-To: References: Message-ID: On Thu, Aug 4, 2011 at 3:42 PM, Georg Brandl wrote: > Am 04.08.2011 05:25, schrieb solipsis at pitrou.net: >> results for 65c412586901 on branch "default" >> -------------------------------------------- >> >> >> >> Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R', >> '3:3:/home/antoine/cpython/refleaks/reflogso7nu3', '-x'] > > Do we need this mail even if there are no leaks to report? I find it useful in order to tell the difference between "no leaks to report" and "refleak checking job is no longer running" Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From solipsis at pitrou.net Thu Aug 4 13:12:00 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 4 Aug 2011 13:12:00 +0200 Subject: [Python-Dev] Daily reference leaks (65c412586901): sum=0 References: Message-ID: <20110804131200.026acdc0@pitrou.net> On Thu, 4 Aug 2011 15:54:54 +1000 Nick Coghlan wrote: > On Thu, Aug 4, 2011 at 3:42 PM, Georg Brandl wrote: > > Am 04.08.2011 05:25, schrieb solipsis at pitrou.net: > >> results for 65c412586901 on branch "default" > >> -------------------------------------------- > >> > >> > >> > >> Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R', > >> '3:3:/home/antoine/cpython/refleaks/reflogso7nu3', '-x'] > > > > Do we need this mail even if there are no leaks to report? > > I find it useful in order to tell the difference between "no leaks to > report" and "refleak checking job is no longer running" That's exactly why I'm sending it every day :) Regards Antoine. From rpjday at crashcourse.ca Fri Aug 5 15:01:01 2011 From: rpjday at crashcourse.ca (Robert P. J. Day) Date: Fri, 5 Aug 2011 09:01:01 -0400 (EDT) Subject: [Python-Dev] what is the significance of "plat-linux2" in the python build process? Message-ID: (note: i'm not a python dev subscriber so please make sure you CC me with any advice and i'm hoping this desperate plea for assistance is at least enough on-point for this list that someone can help me out. and, yes, this is rather verbose but i wanted to supply all of the relevant details in one shot.) i asked about this on the general python help list but i suspect it's more appropriate to ask developers about this and i'm hoping someone can clear this up for me. i'm building an embedded system using wind river linux 4.2 (WRL 4.2), and part of that build process involves downloading, patching and compiling python 2.6.2 for the eventual target filesystem. this is all being done on my 64-bit ubuntu 11.04 system and, to keep things simple, i'm not even cross-compiling, i've selected a common 64-bit PC as the target, so i should be able to ignore any cross compile-related glitches i might have had. this build process works just fine for everyone else on the planet but it fails for me because i'm doing something apparently no one else has tried -- i'm running a (hand-rolled) linux 3.x kernel on my build host and it *seems* that that's what's messing up the python compilation somewhere in the WRL build scripts. (as a side note, i have run across other issues in the WRL build system where it was simply never imagined that one might want to build on a system running a 3.x kernel, so that's why i'm suspecting it has something to do with that. apparently, i'm out there on the bleeding edge and this is what might be causing me grief.) the symptom seems to be that there is confusion in the python build process between two directories: "plat-linux2" and "plat-linux3". my first simple question would be: what do those names represent and how should they appear in the build process? as a benchmark, i downloaded an *absolutely stock* Python-2.6.2 tarball, untarred it, ran "./configure", then searched for any references to those strings just so i could have a basis for comparison. so, immediately after the configure, here's what i found for the stock 2.6.2 python tarball: $ find . -name "plat-linux*" ./Lib/plat-linux2 $ $ grep -r plat-linux * Doc/install/index.rst: ['', '/usr/local/lib/python2.3', '/usr/local/lib/python2.3/plat-linux2', Doc/install/index.rst:'/www/python/lib/pythonX.Y/plat-linux2', ...]``. Misc/RPM/python-2.6.spec:%{__prefix}/%{libdirname}/python%{libvers}/plat-linux2 Misc/HISTORY: * Lib/plat-sunos5/CDIO.py, Lib/plat-linux2/CDROM.py: Misc/HISTORY:e.g. lib-tk, lib-stdwin, plat-win, plat-linux2, plat-sunos5, dos-8x3. $ so, before the build, there are a few references to plat-linux2 and *none* to plat-linux3. i then ran "make" (which seemed to work just fine) and here's the result of a similar search after the make: $ find . -name "plat-linux*" ./Lib/plat-linux2 $ so that's still the same after the build, there is no plat-linux3 file or directory that's been created. however, if i do a recursive grep: $ grep -r plat-linux3 * Binary file libpython2.6.a matches Binary file Modules/getpath.o matches Binary file python matches $ then it's clear that the string "plat-linux3" is now embedded in a small number of the build results. so what does that string represent? why is it there and what does "2" mean compared to "3" in this context? and, most importantly, even though it's there, it didn't stop the build from completing. now take a look at the tail end of the output of the WRL build of python, where things go wrong (what is clearly a packaging step so i'm well aware that this is somewhat outside the scope of normal python building): ===== begin output ===== ... snip ... Checking for unpackaged file(s): /home/rpjday/WindRiver/projects/42/python/host-cross/bin/../lib64/rpm/check-files /home/rpjday/WindRiver/projects/42/python/build/INSTALL_STAGE/python-2.6.2 error: Installed (but unpackaged) file(s) found: /usr/lib64/python2.6/plat-linux3/IN.py /usr/lib64/python2.6/plat-linux3/IN.pyc /usr/lib64/python2.6/plat-linux3/IN.pyo /usr/lib64/python2.6/plat-linux3/regen RPM build errors: File not found: /home/rpjday/WindRiver/projects/42/python/build/INSTALL_STAGE/python-2.6.2/usr/lib64/python2.6/plat-linux2 Installed (but unpackaged) file(s) found: /usr/lib64/python2.6/plat-linux3/IN.py /usr/lib64/python2.6/plat-linux3/IN.pyc /usr/lib64/python2.6/plat-linux3/IN.pyo /usr/lib64/python2.6/plat-linux3/regen /home/rpjday/WindRiver/projects/42/python/scripts/packages.mk:2661: *** [python.install] Error 1 /home/rpjday/WindRiver/projects/42/python/scripts/packages.mk:3017: *** [python.buildlogger] Error 2 /home/rpjday/WindRiver/projects/42/python/scripts/packages.mk:3225: *** [python.build] Error 2 make: *** [python] Error 2 make: Leaving directory `/home/rpjday/WindRiver/projects/42/python/build' ===== end output ===== note the obvious reference to this "plat-linux3" directory that appears out of nowhere that never existed in the stock build. and if i wander down to the WRL python build directory: $ find . -name plat-linux* ./Lib/plat-linux2 ./Lib/plat-linux3 $ $ find Lib/plat-linux[23] Lib/plat-linux2 Lib/plat-linux2/CDROM.py Lib/plat-linux2/DLFCN.py Lib/plat-linux2/IN.py Lib/plat-linux2/regen Lib/plat-linux2/TYPES.py Lib/plat-linux3 Lib/plat-linux3/IN.py Lib/plat-linux3/regen $ and this is as far as i got before confusion set in. it seems that the WRL build process is getting confused about which of those two values to use for the build and ends up scattering generated artifacts across both directories, at which point the packaging step gets confused and gives up. if anyone can clarify what might be going on here and what *should* be happening, i'd be grateful. i realize i'm asking for remote diagnosis on a proprietary build system, i'm just wondering what a native python build *should* look like and what it should produce. thanks muchly for any guidance. rday -- ======================================================================== Robert P. J. Day Ottawa, Ontario, CANADA http://crashcourse.ca Twitter: http://twitter.com/rpjday LinkedIn: http://ca.linkedin.com/in/rpjday ======================================================================== From solipsis at pitrou.net Fri Aug 5 15:31:09 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 5 Aug 2011 15:31:09 +0200 Subject: [Python-Dev] what is the significance of "plat-linux2" in the python build process? References: Message-ID: <20110805153109.79c181b9@pitrou.net> On Fri, 5 Aug 2011 09:01:01 -0400 (EDT) "Robert P. J. Day" wrote: > > this build process works just fine for everyone else on the planet > but it fails for me because i'm doing something apparently no one else > has tried -- i'm running a (hand-rolled) linux 3.x kernel on my build > host and it *seems* that that's what's messing up the python > compilation somewhere in the WRL build scripts. (as a side note, i > have run across other issues in the WRL build system where it was > simply never imagined that one might want to build on a system running > a 3.x kernel, so that's why i'm suspecting it has something to do with > that. apparently, i'm out there on the bleeding edge and this is what > might be causing me grief.) You could take a look at http://bugs.python.org/issue12326 The current 2.7 branch should work for you, you'll have to get it from the Mercurial repository. That says, the plat-* stuff is quite useless as it is. There's a patch here to improve its usefulness slightly: http://bugs.python.org/issue12619 Although I would question the existence of such undocumented modules, which are hardly even used internally. Regards Antoine. From status at bugs.python.org Fri Aug 5 18:07:26 2011 From: status at bugs.python.org (Python tracker) Date: Fri, 5 Aug 2011 18:07:26 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20110805160726.BD8A21D084@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2011-07-29 - 2011-08-05) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 2899 (+10) closed 21579 (+32) total 24478 (+42) Open issues with patches: 1255 Issues opened (27) ================== #1813: Codec lookup failing under turkish locale http://bugs.python.org/issue1813 reopened by skrah #9723: Add shlex.quote http://bugs.python.org/issue9723 reopened by eric.araujo #12656: test.test_asyncore: add tests for AF_INET6 and AF_UNIX sockets http://bugs.python.org/issue12656 opened by neologix #12657: Cannot override JSON encoding of basic type subclasses http://bugs.python.org/issue12657 opened by barry #12659: Add tests for packaging.tests.support http://bugs.python.org/issue12659 opened by eric.araujo #12660: test_gdb fails when installed http://bugs.python.org/issue12660 opened by pitrou #12661: Add a new shutil.cleartree function to shutil module http://bugs.python.org/issue12661 opened by chin #12662: Allow configparser to process suplicate options http://bugs.python.org/issue12662 opened by ojab #12666: map semantic change not documented in What's New http://bugs.python.org/issue12666 opened by jason.coombs #12668: 3.2 What's New: it's integer->string, not the opposite http://bugs.python.org/issue12668 opened by sandro.tosi #12669: test_curses skipped on buildbots http://bugs.python.org/issue12669 opened by nadeem.vawda #12672: Some problems in documentation extending/newtypes.html http://bugs.python.org/issue12672 opened by eli.bendersky #12675: tokenize module happily tokenizes code with syntax errors http://bugs.python.org/issue12675 opened by Gareth.Rees #12677: Turtle, fix right/left rotation orientation http://bugs.python.org/issue12677 opened by sandro.tosi #12678: test_packaging and test_distutils failures under Windows http://bugs.python.org/issue12678 opened by pitrou #12680: cPickle.loads is not thread safe due to non-thread-safe import http://bugs.python.org/issue12680 opened by Sagiv.Malihi #12681: unittest expectedFailure could take a message argument like sk http://bugs.python.org/issue12681 opened by r.david.murray #12682: Meaning of 'accepted' resolution as documented in devguide http://bugs.python.org/issue12682 opened by r.david.murray #12684: profile does not dump stats on exception like cProfile does http://bugs.python.org/issue12684 opened by anacrolix #12686: argparse - document (and improve?) use of SUPPRESS with help= http://bugs.python.org/issue12686 opened by derks #12687: Python 3.2 fails to load protocol 0 pickle http://bugs.python.org/issue12687 opened by vinay.sajip #12690: Tix bug 2643483 http://bugs.python.org/issue12690 opened by Gary.Levin #12691: tokenize.untokenize is broken http://bugs.python.org/issue12691 opened by Gareth.Rees #12692: test_urllib2net is triggering a ResourceWarning http://bugs.python.org/issue12692 opened by brett.cannon #12693: test.support.transient_internet prints to stderr when verbose http://bugs.python.org/issue12693 opened by brett.cannon #12694: crlf.py script from Tools doesn't work with Python 3.2 http://bugs.python.org/issue12694 opened by bialix #12696: pydoc error page due to lacking permissions on ./* http://bugs.python.org/issue12696 opened by gagern Most recent 15 issues with no replies (15) ========================================== #12696: pydoc error page due to lacking permissions on ./* http://bugs.python.org/issue12696 #12694: crlf.py script from Tools doesn't work with Python 3.2 http://bugs.python.org/issue12694 #12684: profile does not dump stats on exception like cProfile does http://bugs.python.org/issue12684 #12672: Some problems in documentation extending/newtypes.html http://bugs.python.org/issue12672 #12668: 3.2 What's New: it's integer->string, not the opposite http://bugs.python.org/issue12668 #12662: Allow configparser to process suplicate options http://bugs.python.org/issue12662 #12660: test_gdb fails when installed http://bugs.python.org/issue12660 #12659: Add tests for packaging.tests.support http://bugs.python.org/issue12659 #12657: Cannot override JSON encoding of basic type subclasses http://bugs.python.org/issue12657 #12656: test.test_asyncore: add tests for AF_INET6 and AF_UNIX sockets http://bugs.python.org/issue12656 #12653: Provide accelerators for all buttons in Windows installers http://bugs.python.org/issue12653 #12645: test.support. import_fresh_module - incorrect doc http://bugs.python.org/issue12645 #12639: msilib Directory.start_component() fails if keyfile is not Non http://bugs.python.org/issue12639 #12623: "universal newlines" subprocess support broken with select- an http://bugs.python.org/issue12623 #12622: failfast argument to TextTestRunner not documented http://bugs.python.org/issue12622 Most recent 15 issues waiting for review (15) ============================================= #12684: profile does not dump stats on exception like cProfile does http://bugs.python.org/issue12684 #12677: Turtle, fix right/left rotation orientation http://bugs.python.org/issue12677 #12668: 3.2 What's New: it's integer->string, not the opposite http://bugs.python.org/issue12668 #12661: Add a new shutil.cleartree function to shutil module http://bugs.python.org/issue12661 #12656: test.test_asyncore: add tests for AF_INET6 and AF_UNIX sockets http://bugs.python.org/issue12656 #12652: Keep test.support docs out of the global docs index http://bugs.python.org/issue12652 #12650: Subprocess leaks fd upon kill() http://bugs.python.org/issue12650 #12646: zlib.Decompress.decompress/flush do not raise any exceptions w http://bugs.python.org/issue12646 #12639: msilib Directory.start_component() fails if keyfile is not Non http://bugs.python.org/issue12639 #12633: sys.modules doc entry should reflect restrictions http://bugs.python.org/issue12633 #12627: Implement PEP 394: The "python" Command on Unix-Like Systems http://bugs.python.org/issue12627 #12625: sporadic test_unittest failure http://bugs.python.org/issue12625 #12619: Automatically regenerate platform-specific modules http://bugs.python.org/issue12619 #12618: py_compile cannot create files in current directory http://bugs.python.org/issue12618 #12614: Allow to explicitly set the method of urllib.request.Request http://bugs.python.org/issue12614 Top 10 most discussed issues (10) ================================= #12675: tokenize module happily tokenizes code with syntax errors http://bugs.python.org/issue12675 8 msgs #11049: add tests for test.support http://bugs.python.org/issue11049 7 msgs #7424: segmentation fault in listextend during install http://bugs.python.org/issue7424 6 msgs #11572: bring Lib/copy.py to 100% coverage http://bugs.python.org/issue11572 6 msgs #12648: Wrong import module search order on Windows http://bugs.python.org/issue12648 6 msgs #12682: Meaning of 'accepted' resolution as documented in devguide http://bugs.python.org/issue12682 6 msgs #1813: Codec lookup failing under turkish locale http://bugs.python.org/issue1813 5 msgs #12652: Keep test.support docs out of the global docs index http://bugs.python.org/issue12652 5 msgs #8639: Allow callable objects in inspect.getfullargspec http://bugs.python.org/issue8639 4 msgs #9968: cgi.FieldStorage: Give control about the directory used for up http://bugs.python.org/issue9968 4 msgs Issues closed (33) ================== #9788: atexit and execution order http://bugs.python.org/issue9788 closed by eric.araujo #11104: distutils sdist ignores MANIFEST http://bugs.python.org/issue11104 closed by eric.araujo #11281: smtplib: add ability to bind to specific source IP address/por http://bugs.python.org/issue11281 closed by python-dev #11651: Improve test targets in Makefile http://bugs.python.org/issue11651 closed by nadeem.vawda #11699: Doc for optparse.OptionParser.get_option_group is wrong http://bugs.python.org/issue11699 closed by eli.bendersky #11933: newer() function in dep_util.py mixes up new vs. old files due http://bugs.python.org/issue11933 closed by eric.araujo #12183: Document behaviour of shutil.copy2 and copystat with symlinks http://bugs.python.org/issue12183 closed by python-dev #12295: Fix ResourceWarning in turtledemo help window http://bugs.python.org/issue12295 closed by eric.araujo #12331: lib2to3 and packaging tests fail because they write into prote http://bugs.python.org/issue12331 closed by eric.araujo #12464: tempfile.TemporaryDirectory.cleanup follows symbolic links http://bugs.python.org/issue12464 closed by neologix #12531: documentation index entries for * and ** http://bugs.python.org/issue12531 closed by eli.bendersky #12540: "Restart Shell" command leaves pythonw.exe processes running http://bugs.python.org/issue12540 closed by ned.deily #12562: calling mmap twice fails on Windows http://bugs.python.org/issue12562 closed by pitrou #12626: run test cases based on a glob filter http://bugs.python.org/issue12626 closed by pitrou #12631: Mutable Sequence Type in .remove() is consistent only with lis http://bugs.python.org/issue12631 closed by petri.lehtinen #12654: sum() works with bytes objects http://bugs.python.org/issue12654 closed by benjamin.peterson #12655: Expose sched.h functions http://bugs.python.org/issue12655 closed by python-dev #12658: Build fails in a non-checkout directory http://bugs.python.org/issue12658 closed by pitrou #12663: ArgumentParser.error writes to stderr not to stdout http://bugs.python.org/issue12663 closed by python-dev #12664: Path variable - Windows installer http://bugs.python.org/issue12664 closed by r.david.murray #12665: Dictionary view example has error in set ops http://bugs.python.org/issue12665 closed by sandro.tosi #12667: Better logging.handler.SMTPHandler doc for 'secure' argument http://bugs.python.org/issue12667 closed by python-dev #12670: Fix struct code after forward declaration on ctypes doc http://bugs.python.org/issue12670 closed by sandro.tosi #12671: urlopen returning empty string http://bugs.python.org/issue12671 closed by mrabarnett #12673: SEGFAULT error on OpenBSD (sparc) http://bugs.python.org/issue12673 closed by r.david.murray #12674: pydoc str.split does not find the method http://bugs.python.org/issue12674 closed by r.david.murray #12676: Bug in http.client http://bugs.python.org/issue12676 closed by python-dev #12679: ThreadError is not in threading.__all__ http://bugs.python.org/issue12679 closed by python-dev #12683: urlparse.urljoin different behavior for different scheme http://bugs.python.org/issue12683 closed by python-dev #12685: The backslash escape doesn't concatenate two strings in one in http://bugs.python.org/issue12685 closed by benjamin.peterson #12688: ConfigParser.__init__(iterpolation=None) documentation != beha http://bugs.python.org/issue12688 closed by lukasz.langa #12689: IDLE crashes after pressing ctrl+space http://bugs.python.org/issue12689 closed by r.david.murray #12695: subprocess.Popen: OSError: [Errno 9] Bad file descriptor http://bugs.python.org/issue12695 closed by gagern From chris at simplistix.co.uk Fri Aug 5 19:35:56 2011 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 05 Aug 2011 18:35:56 +0100 Subject: [Python-Dev] cpython (2.7): note Ellipsis syntax In-Reply-To: References: Message-ID: <4E3C29FC.8090501@simplistix.co.uk> On 31/07/2011 07:47, Raymond Hettinger wrote: > > It's really nice for stub functions: > > def foo(x): > ... I guess pass is too pass-?? ;-) Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From jimjjewett at gmail.com Fri Aug 5 23:55:33 2011 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 5 Aug 2011 17:55:33 -0400 Subject: [Python-Dev] [Python-checkins] cpython: #11572: improvements to copy module tests along with removal of old test suite In-Reply-To: References: Message-ID: Why was the old test suite removed? Even if everything is covered by the test file (and that isn't clear from this checkin), I don't see anything wrong with a quick test that doesn't require loading the whole testing apparatus. (I would have no objection to including a comment saying that the majority of the tests are in the test file; I just wonder why they have to be removed entirely.) On Fri, Aug 5, 2011 at 5:06 PM, sandro.tosi wrote: > http://hg.python.org/cpython/rev/74e79b2c114a > changeset: ? 71749:74e79b2c114a > user: ? ? ? ?Sandro Tosi > date: ? ? ? ?Fri Aug 05 23:05:35 2011 +0200 > summary: > ?#11572: improvements to copy module tests along with removal of old test suite > > files: > ?Lib/copy.py ? ? ? ? ? | ? 65 ----------- > ?Lib/test/test_copy.py | ?168 ++++++++++++++++------------- > ?2 files changed, 95 insertions(+), 138 deletions(-) > > > diff --git a/Lib/copy.py b/Lib/copy.py > --- a/Lib/copy.py > +++ b/Lib/copy.py > @@ -323,68 +323,3 @@ > ?# Helper for instance creation without calling __init__ > ?class _EmptyClass: > ? ? pass > - > -def _test(): > - ? ?l = [None, 1, 2, 3.14, 'xyzzy', (1, 2), [3.14, 'abc'], > - ? ? ? ? {'abc': 'ABC'}, (), [], {}] > - ? ?l1 = copy(l) > - ? ?print(l1==l) > - ? ?l1 = map(copy, l) > - ? ?print(l1==l) > - ? ?l1 = deepcopy(l) > - ? ?print(l1==l) > - ? ?class C: > - ? ? ? ?def __init__(self, arg=None): > - ? ? ? ? ? ?self.a = 1 > - ? ? ? ? ? ?self.arg = arg > - ? ? ? ? ? ?if __name__ == '__main__': > - ? ? ? ? ? ? ? ?import sys > - ? ? ? ? ? ? ? ?file = sys.argv[0] > - ? ? ? ? ? ?else: > - ? ? ? ? ? ? ? ?file = __file__ > - ? ? ? ? ? ?self.fp = open(file) > - ? ? ? ? ? ?self.fp.close() > - ? ? ? ?def __getstate__(self): > - ? ? ? ? ? ?return {'a': self.a, 'arg': self.arg} > - ? ? ? ?def __setstate__(self, state): > - ? ? ? ? ? ?for key, value in state.items(): > - ? ? ? ? ? ? ? ?setattr(self, key, value) > - ? ? ? ?def __deepcopy__(self, memo=None): > - ? ? ? ? ? ?new = self.__class__(deepcopy(self.arg, memo)) > - ? ? ? ? ? ?new.a = self.a > - ? ? ? ? ? ?return new > - ? ?c = C('argument sketch') > - ? ?l.append(c) > - ? ?l2 = copy(l) > - ? ?print(l == l2) > - ? ?print(l) > - ? ?print(l2) > - ? ?l2 = deepcopy(l) > - ? ?print(l == l2) > - ? ?print(l) > - ? ?print(l2) > - ? ?l.append({l[1]: l, 'xyz': l[2]}) > - ? ?l3 = copy(l) > - ? ?import reprlib > - ? ?print(map(reprlib.repr, l)) > - ? ?print(map(reprlib.repr, l1)) > - ? ?print(map(reprlib.repr, l2)) > - ? ?print(map(reprlib.repr, l3)) > - ? ?l3 = deepcopy(l) > - ? ?print(map(reprlib.repr, l)) > - ? ?print(map(reprlib.repr, l1)) > - ? ?print(map(reprlib.repr, l2)) > - ? ?print(map(reprlib.repr, l3)) > - ? ?class odict(dict): > - ? ? ? ?def __init__(self, d = {}): > - ? ? ? ? ? ?self.a = 99 > - ? ? ? ? ? ?dict.__init__(self, d) > - ? ? ? ?def __setitem__(self, k, i): > - ? ? ? ? ? ?dict.__setitem__(self, k, i) > - ? ? ? ? ? ?self.a > - ? ?o = odict({"A" : "B"}) > - ? ?x = deepcopy(o) > - ? ?print(o, x) > - > -if __name__ == '__main__': > - ? ?_test() > diff --git a/Lib/test/test_copy.py b/Lib/test/test_copy.py > --- a/Lib/test/test_copy.py > +++ b/Lib/test/test_copy.py > @@ -17,7 +17,7 @@ > ? ? # Attempt full line coverage of copy.py from top to bottom > > ? ? def test_exceptions(self): > - ? ? ? ?self.assertTrue(copy.Error is copy.error) > + ? ? ? ?self.assertIs(copy.Error, copy.error) > ? ? ? ? self.assertTrue(issubclass(copy.Error, Exception)) > > ? ? # The copy() method > @@ -54,20 +54,26 @@ > ? ? def test_copy_reduce_ex(self): > ? ? ? ? class C(object): > ? ? ? ? ? ? def __reduce_ex__(self, proto): > + ? ? ? ? ? ? ? ?c.append(1) > ? ? ? ? ? ? ? ? return "" > ? ? ? ? ? ? def __reduce__(self): > - ? ? ? ? ? ? ? ?raise support.TestFailed("shouldn't call this") > + ? ? ? ? ? ? ? ?self.fail("shouldn't call this") > + ? ? ? ?c = [] > ? ? ? ? x = C() > ? ? ? ? y = copy.copy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > + ? ? ? ?self.assertEqual(c, [1]) > > ? ? def test_copy_reduce(self): > ? ? ? ? class C(object): > ? ? ? ? ? ? def __reduce__(self): > + ? ? ? ? ? ? ? ?c.append(1) > ? ? ? ? ? ? ? ? return "" > + ? ? ? ?c = [] > ? ? ? ? x = C() > ? ? ? ? y = copy.copy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > + ? ? ? ?self.assertEqual(c, [1]) > > ? ? def test_copy_cant(self): > ? ? ? ? class C(object): > @@ -91,7 +97,7 @@ > ? ? ? ? ? ? ? ? ?"hello", "hello\u1234", f.__code__, > ? ? ? ? ? ? ? ? ?NewStyle, range(10), Classic, max] > ? ? ? ? for x in tests: > - ? ? ? ? ? ?self.assertTrue(copy.copy(x) is x, repr(x)) > + ? ? ? ? ? ?self.assertIs(copy.copy(x), x) > > ? ? def test_copy_list(self): > ? ? ? ? x = [1, 2, 3] > @@ -185,9 +191,9 @@ > ? ? ? ? x = [x, x] > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y[0] is not x[0]) > - ? ? ? ?self.assertTrue(y[0] is y[1]) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y[0], x[0]) > + ? ? ? ?self.assertIs(y[0], y[1]) > > ? ? def test_deepcopy_issubclass(self): > ? ? ? ? # XXX Note: there's no way to test the TypeError coming out of > @@ -227,20 +233,26 @@ > ? ? def test_deepcopy_reduce_ex(self): > ? ? ? ? class C(object): > ? ? ? ? ? ? def __reduce_ex__(self, proto): > + ? ? ? ? ? ? ? ?c.append(1) > ? ? ? ? ? ? ? ? return "" > ? ? ? ? ? ? def __reduce__(self): > - ? ? ? ? ? ? ? ?raise support.TestFailed("shouldn't call this") > + ? ? ? ? ? ? ? ?self.fail("shouldn't call this") > + ? ? ? ?c = [] > ? ? ? ? x = C() > ? ? ? ? y = copy.deepcopy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > + ? ? ? ?self.assertEqual(c, [1]) > > ? ? def test_deepcopy_reduce(self): > ? ? ? ? class C(object): > ? ? ? ? ? ? def __reduce__(self): > + ? ? ? ? ? ? ? ?c.append(1) > ? ? ? ? ? ? ? ? return "" > + ? ? ? ?c = [] > ? ? ? ? x = C() > ? ? ? ? y = copy.deepcopy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > + ? ? ? ?self.assertEqual(c, [1]) > > ? ? def test_deepcopy_cant(self): > ? ? ? ? class C(object): > @@ -264,14 +276,14 @@ > ? ? ? ? ? ? ? ? ?"hello", "hello\u1234", f.__code__, > ? ? ? ? ? ? ? ? ?NewStyle, range(10), Classic, max] > ? ? ? ? for x in tests: > - ? ? ? ? ? ?self.assertTrue(copy.deepcopy(x) is x, repr(x)) > + ? ? ? ? ? ?self.assertIs(copy.deepcopy(x), x) > > ? ? def test_deepcopy_list(self): > ? ? ? ? x = [[1, 2], 3] > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x[0] is not y[0]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIsNot(x[0], y[0]) > > ? ? def test_deepcopy_reflexive_list(self): > ? ? ? ? x = [] > @@ -279,16 +291,26 @@ > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? for op in comparisons: > ? ? ? ? ? ? self.assertRaises(RuntimeError, op, y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y[0] is y) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIs(y[0], y) > ? ? ? ? self.assertEqual(len(y), 1) > > + ? ?def test_deepcopy_empty_tuple(self): > + ? ? ? ?x = () > + ? ? ? ?y = copy.deepcopy(x) > + ? ? ? ?self.assertIs(x, y) > + > ? ? def test_deepcopy_tuple(self): > ? ? ? ? x = ([1, 2], 3) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x[0] is not y[0]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIsNot(x[0], y[0]) > + > + ? ?def test_deepcopy_tuple_of_immutables(self): > + ? ? ? ?x = ((1, 2), 3) > + ? ? ? ?y = copy.deepcopy(x) > + ? ? ? ?self.assertIs(x, y) > > ? ? def test_deepcopy_reflexive_tuple(self): > ? ? ? ? x = ([],) > @@ -296,16 +318,16 @@ > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? for op in comparisons: > ? ? ? ? ? ? self.assertRaises(RuntimeError, op, y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y[0] is not x[0]) > - ? ? ? ?self.assertTrue(y[0][0] is y) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y[0], x[0]) > + ? ? ? ?self.assertIs(y[0][0], y) > > ? ? def test_deepcopy_dict(self): > ? ? ? ? x = {"foo": [1, 2], "bar": 3} > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x["foo"] is not y["foo"]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIsNot(x["foo"], y["foo"]) > > ? ? def test_deepcopy_reflexive_dict(self): > ? ? ? ? x = {} > @@ -315,8 +337,8 @@ > ? ? ? ? ? ? self.assertRaises(TypeError, op, y, x) > ? ? ? ? for op in equality_comparisons: > ? ? ? ? ? ? self.assertRaises(RuntimeError, op, y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y['foo'] is y) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIs(y['foo'], y) > ? ? ? ? self.assertEqual(len(y), 1) > > ? ? def test_deepcopy_keepalive(self): > @@ -349,7 +371,7 @@ > ? ? ? ? x = C([42]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_deepcopy_inst_deepcopy(self): > ? ? ? ? class C: > @@ -362,8 +384,8 @@ > ? ? ? ? x = C([42]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_deepcopy_inst_getinitargs(self): > ? ? ? ? class C: > @@ -376,8 +398,8 @@ > ? ? ? ? x = C([42]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_deepcopy_inst_getstate(self): > ? ? ? ? class C: > @@ -390,8 +412,8 @@ > ? ? ? ? x = C([42]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_deepcopy_inst_setstate(self): > ? ? ? ? class C: > @@ -404,8 +426,8 @@ > ? ? ? ? x = C([42]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_deepcopy_inst_getstate_setstate(self): > ? ? ? ? class C: > @@ -420,8 +442,8 @@ > ? ? ? ? x = C([42]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_deepcopy_reflexive_inst(self): > ? ? ? ? class C: > @@ -429,8 +451,8 @@ > ? ? ? ? x = C() > ? ? ? ? x.foo = x > ? ? ? ? y = copy.deepcopy(x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is y) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIs(y.foo, y) > > ? ? # _reconstruct() > > @@ -440,9 +462,9 @@ > ? ? ? ? ? ? ? ? return "" > ? ? ? ? x = C() > ? ? ? ? y = copy.copy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > ? ? ? ? y = copy.deepcopy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > > ? ? def test_reconstruct_nostate(self): > ? ? ? ? class C(object): > @@ -451,9 +473,9 @@ > ? ? ? ? x = C() > ? ? ? ? x.foo = 42 > ? ? ? ? y = copy.copy(x) > - ? ? ? ?self.assertTrue(y.__class__ is x.__class__) > + ? ? ? ?self.assertIs(y.__class__, x.__class__) > ? ? ? ? y = copy.deepcopy(x) > - ? ? ? ?self.assertTrue(y.__class__ is x.__class__) > + ? ? ? ?self.assertIs(y.__class__, x.__class__) > > ? ? def test_reconstruct_state(self): > ? ? ? ? class C(object): > @@ -467,7 +489,7 @@ > ? ? ? ? self.assertEqual(y, x) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_reconstruct_state_setstate(self): > ? ? ? ? class C(object): > @@ -483,7 +505,7 @@ > ? ? ? ? self.assertEqual(y, x) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(y, x) > - ? ? ? ?self.assertTrue(y.foo is not x.foo) > + ? ? ? ?self.assertIsNot(y.foo, x.foo) > > ? ? def test_reconstruct_reflexive(self): > ? ? ? ? class C(object): > @@ -491,8 +513,8 @@ > ? ? ? ? x = C() > ? ? ? ? x.foo = x > ? ? ? ? y = copy.deepcopy(x) > - ? ? ? ?self.assertTrue(y is not x) > - ? ? ? ?self.assertTrue(y.foo is y) > + ? ? ? ?self.assertIsNot(y, x) > + ? ? ? ?self.assertIs(y.foo, y) > > ? ? # Additions for Python 2.3 and pickle protocol 2 > > @@ -506,12 +528,12 @@ > ? ? ? ? x = C([[1, 2], 3]) > ? ? ? ? y = copy.copy(x) > ? ? ? ? self.assertEqual(x, y) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x[0] is y[0]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIs(x[0], y[0]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(x, y) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x[0] is not y[0]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIsNot(x[0], y[0]) > > ? ? def test_reduce_5tuple(self): > ? ? ? ? class C(dict): > @@ -523,12 +545,12 @@ > ? ? ? ? x = C([("foo", [1, 2]), ("bar", 3)]) > ? ? ? ? y = copy.copy(x) > ? ? ? ? self.assertEqual(x, y) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x["foo"] is y["foo"]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIs(x["foo"], y["foo"]) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(x, y) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x["foo"] is not y["foo"]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIsNot(x["foo"], y["foo"]) > > ? ? def test_copy_slots(self): > ? ? ? ? class C(object): > @@ -536,7 +558,7 @@ > ? ? ? ? x = C() > ? ? ? ? x.foo = [42] > ? ? ? ? y = copy.copy(x) > - ? ? ? ?self.assertTrue(x.foo is y.foo) > + ? ? ? ?self.assertIs(x.foo, y.foo) > > ? ? def test_deepcopy_slots(self): > ? ? ? ? class C(object): > @@ -545,7 +567,7 @@ > ? ? ? ? x.foo = [42] > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(x.foo, y.foo) > - ? ? ? ?self.assertTrue(x.foo is not y.foo) > + ? ? ? ?self.assertIsNot(x.foo, y.foo) > > ? ? def test_deepcopy_dict_subclass(self): > ? ? ? ? class C(dict): > @@ -562,7 +584,7 @@ > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(x, y) > ? ? ? ? self.assertEqual(x._keys, y._keys) > - ? ? ? ?self.assertTrue(x is not y) > + ? ? ? ?self.assertIsNot(x, y) > ? ? ? ? x['bar'] = 1 > ? ? ? ? self.assertNotEqual(x, y) > ? ? ? ? self.assertNotEqual(x._keys, y._keys) > @@ -575,8 +597,8 @@ > ? ? ? ? y = copy.copy(x) > ? ? ? ? self.assertEqual(list(x), list(y)) > ? ? ? ? self.assertEqual(x.foo, y.foo) > - ? ? ? ?self.assertTrue(x[0] is y[0]) > - ? ? ? ?self.assertTrue(x.foo is y.foo) > + ? ? ? ?self.assertIs(x[0], y[0]) > + ? ? ? ?self.assertIs(x.foo, y.foo) > > ? ? def test_deepcopy_list_subclass(self): > ? ? ? ? class C(list): > @@ -586,8 +608,8 @@ > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(list(x), list(y)) > ? ? ? ? self.assertEqual(x.foo, y.foo) > - ? ? ? ?self.assertTrue(x[0] is not y[0]) > - ? ? ? ?self.assertTrue(x.foo is not y.foo) > + ? ? ? ?self.assertIsNot(x[0], y[0]) > + ? ? ? ?self.assertIsNot(x.foo, y.foo) > > ? ? def test_copy_tuple_subclass(self): > ? ? ? ? class C(tuple): > @@ -604,8 +626,8 @@ > ? ? ? ? self.assertEqual(tuple(x), ([1, 2], 3)) > ? ? ? ? y = copy.deepcopy(x) > ? ? ? ? self.assertEqual(tuple(y), ([1, 2], 3)) > - ? ? ? ?self.assertTrue(x is not y) > - ? ? ? ?self.assertTrue(x[0] is not y[0]) > + ? ? ? ?self.assertIsNot(x, y) > + ? ? ? ?self.assertIsNot(x[0], y[0]) > > ? ? def test_getstate_exc(self): > ? ? ? ? class EvilState(object): > @@ -633,10 +655,10 @@ > ? ? ? ? obj = C() > ? ? ? ? x = weakref.ref(obj) > ? ? ? ? y = _copy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > ? ? ? ? del obj > ? ? ? ? y = _copy(x) > - ? ? ? ?self.assertTrue(y is x) > + ? ? ? ?self.assertIs(y, x) > > ? ? def test_copy_weakref(self): > ? ? ? ? self._check_weakref(copy.copy) > @@ -652,7 +674,7 @@ > ? ? ? ? u[a] = b > ? ? ? ? u[c] = d > ? ? ? ? v = copy.copy(u) > - ? ? ? ?self.assertFalse(v is u) > + ? ? ? ?self.assertIsNot(v, u) > ? ? ? ? self.assertEqual(v, u) > ? ? ? ? self.assertEqual(v[a], b) > ? ? ? ? self.assertEqual(v[c], d) > @@ -682,8 +704,8 @@ > ? ? ? ? v = copy.deepcopy(u) > ? ? ? ? self.assertNotEqual(v, u) > ? ? ? ? self.assertEqual(len(v), 2) > - ? ? ? ?self.assertFalse(v[a] is b) > - ? ? ? ?self.assertFalse(v[c] is d) > + ? ? ? ?self.assertIsNot(v[a], b) > + ? ? ? ?self.assertIsNot(v[c], d) > ? ? ? ? self.assertEqual(v[a].i, b.i) > ? ? ? ? self.assertEqual(v[c].i, d.i) > ? ? ? ? del c > @@ -702,12 +724,12 @@ > ? ? ? ? self.assertNotEqual(v, u) > ? ? ? ? self.assertEqual(len(v), 2) > ? ? ? ? (x, y), (z, t) = sorted(v.items(), key=lambda pair: pair[0].i) > - ? ? ? ?self.assertFalse(x is a) > + ? ? ? ?self.assertIsNot(x, a) > ? ? ? ? self.assertEqual(x.i, a.i) > - ? ? ? ?self.assertTrue(y is b) > - ? ? ? ?self.assertFalse(z is c) > + ? ? ? ?self.assertIs(y, b) > + ? ? ? ?self.assertIsNot(z, c) > ? ? ? ? self.assertEqual(z.i, c.i) > - ? ? ? ?self.assertTrue(t is d) > + ? ? ? ?self.assertIs(t, d) > ? ? ? ? del x, y, z, t > ? ? ? ? del d > ? ? ? ? self.assertEqual(len(v), 1) > @@ -720,7 +742,7 @@ > ? ? ? ? f.b = f.m > ? ? ? ? g = copy.deepcopy(f) > ? ? ? ? self.assertEqual(g.m, g.b) > - ? ? ? ?self.assertTrue(g.b.__self__ is g) > + ? ? ? ?self.assertIs(g.b.__self__, g) > ? ? ? ? g.b() > > > > -- > Repository URL: http://hg.python.org/cpython > > _______________________________________________ > Python-checkins mailing list > Python-checkins at python.org > http://mail.python.org/mailman/listinfo/python-checkins > > From sandro.tosi at gmail.com Sat Aug 6 00:03:11 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Sat, 6 Aug 2011 00:03:11 +0200 Subject: [Python-Dev] [Python-checkins] cpython: #11572: improvements to copy module tests along with removal of old test suite In-Reply-To: References: Message-ID: Hi Jim, On Fri, Aug 5, 2011 at 23:55, Jim Jewett wrote: > Why was the old test suite removed? > > Even if everything is covered by the test file (and that isn't clear > from this checkin), I don't see anything wrong with a quick test that > doesn't require loading the whole testing apparatus. ?(I would have no > objection to including a comment saying that the majority of the tests > are in the test file; I just wonder why they have to be removed > entirely.) I see these reasons mainly: - it adds nothing to the stdlib (where it was included): they are tests, so they should be in the test suite - it's unmaintained, since all the work on new tests or any change will happen on the test_copy.py file and not in the copy.py (that's true for any other module) - and also running the tests for a single modules is just a (in this case, I keep using copy: ./python -m test test_copy and it has the advantage of running the whole test suite for that module, not just some random code. I plan to do other changes like this in the next days/weeks, so actually thanks for the question :) since it bring that up to python-dev we others can comment. Cheers, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From solipsis at pitrou.net Sat Aug 6 00:07:04 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 6 Aug 2011 00:07:04 +0200 Subject: [Python-Dev] [Python-checkins] cpython: #11572: improvements to copy module tests along with removal of old test suite References: Message-ID: <20110806000704.6ab08fb7@pitrou.net> On Fri, 5 Aug 2011 17:55:33 -0400 Jim Jewett wrote: > Why was the old test suite removed? > > Even if everything is covered by the test file (and that isn't clear > from this checkin), I don't see anything wrong with a quick test that > doesn't require loading the whole testing apparatus. (I would have no > objection to including a comment saying that the majority of the tests > are in the test file; I just wonder why they have to be removed > entirely.) Nobody ever runs such tests when they are not part of the official regression test suite, which makes them barely useful. Looking at them, they don't seem very advanced and are probably covered by test_copy already. The only reason to have special code in the __main__ section of stdlib modules is when it provides some (interactive) service to the user (for example, "python -m zipfile" will give you a trivial equivalent of the zip/unzip commands). Regards Antoine. From cwg at falma.de Sat Aug 6 13:55:36 2011 From: cwg at falma.de (Christoph Groth) Date: Sat, 06 Aug 2011 13:55:36 +0200 Subject: [Python-Dev] inconsistent __abstractmethods__ behavior; lack of documentation Message-ID: <87bow24uvb.fsf@falma.de> Hi, while playing with abstract base classes and looking at their implementation, I've stumbled across the following issue. With Python 3.2, the script class Foo(object): __abstractmethods__ = ['boo'] class Bar(object): pass Bar.__abstractmethods__ = ['boo'] f = Foo() b = Bar() produces the following output Traceback (most recent call last): File "/home/cwg/test2.py", line 9, in b = Bar() TypeError: Can't instantiate abstract class Bar with abstract methods buzz This seems to violate PEP 3119: it is not mentioned there that setting the __abstractmethods__ attribute already during class definition (as in "Foo") should have no effect. I think this happens because CPython uses the Py_TPFLAGS_IS_ABSTRACT flag to check whether a class is abstract. Apparently, this flag is not set when the dictionary of the class contains __abstractmethods__ already upon creation. As a second issue, the special __abstractmethods__ attribute (which is a feature of the interpreter) is not mentioned anywhere in the documentation. If these are confirmed to be bugs, I can enter them into the issue tracker. Christoph From guido at python.org Sat Aug 6 14:29:09 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 6 Aug 2011 08:29:09 -0400 Subject: [Python-Dev] inconsistent __abstractmethods__ behavior; lack of documentation In-Reply-To: <87bow24uvb.fsf@falma.de> References: <87bow24uvb.fsf@falma.de> Message-ID: Christoph, Do you realize that __xxx__ names can have any semantics they darn well please? If a particular __xxx__ name (or some aspect of it) is undocumented that's not a bug (not even a doc bug), it just means "hands off". That said, there may well be a bug, but it would be in the behavior of those things that *are* documented. --Guido On Sat, Aug 6, 2011 at 7:55 AM, Christoph Groth wrote: > Hi, > > while playing with abstract base classes and looking at their > implementation, I've stumbled across the following issue. ?With Python > 3.2, the script > > class Foo(object): > ? ?__abstractmethods__ = ['boo'] > class Bar(object): > ? ?pass > Bar.__abstractmethods__ = ['boo'] > f = Foo() > b = Bar() > > produces the following output > > Traceback (most recent call last): > ?File "/home/cwg/test2.py", line 9, in > ? ?b = Bar() > TypeError: Can't instantiate abstract class Bar with abstract methods buzz > > This seems to violate PEP 3119: it is not mentioned there that setting > the __abstractmethods__ attribute already during class definition (as in > "Foo") should have no effect. > > I think this happens because CPython uses the Py_TPFLAGS_IS_ABSTRACT > flag to check whether a class is abstract. ?Apparently, this flag is not > set when the dictionary of the class contains __abstractmethods__ > already upon creation. > > As a second issue, the special __abstractmethods__ attribute (which is a > feature of the interpreter) is not mentioned anywhere in the > documentation. > > If these are confirmed to be bugs, I can enter them into the issue > tracker. > > Christoph > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) From cwg at falma.de Sat Aug 6 14:54:38 2011 From: cwg at falma.de (Christoph Groth) Date: Sat, 06 Aug 2011 14:54:38 +0200 Subject: [Python-Dev] inconsistent __abstractmethods__ behavior; lack of documentation References: <87bow24uvb.fsf@falma.de> Message-ID: <8739he4s4x.fsf@falma.de> Guido, thanks for the quick reply! Of course I am aware that __xxx__ names are special. But I was assuming that the features of a python interpreter which are necessary to execute the pure python modules of the standard library are supposed to be documented. Christoph From tjreedy at udel.edu Sat Aug 6 23:58:14 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 06 Aug 2011 17:58:14 -0400 Subject: [Python-Dev] inconsistent __abstractmethods__ behavior; lack of documentation In-Reply-To: References: <87bow24uvb.fsf@falma.de> Message-ID: On 8/6/2011 8:29 AM, Guido van Rossum wrote: > Do you realize that __xxx__ names can have any semantics they darn > well please? That does not seem to be to be the issue Cristoff raised. > If a particular __xxx__ name (or some aspect of it) is > undocumented that's not a bug (not even a doc bug), it just means > "hands off". "__abstractmethods__" is used in the stdlib at least in abc.py: 95 class ABCMeta(type): ... 116 def __new__(mcls, name, bases, namespace): ... 123 for name in getattr(base, "__abstractmethods__", set()): 124 value = getattr(cls, name, None) 125 if getattr(value, "__isabstractmethod__", False): 126 abstracts.add(name) 127 cls.__abstractmethods__ = frozenset(abstracts) Since this module implements a PEP (3119) and is not marked as CPython specific, it should run correctly on all implementations. So implementors need to know what the above means. ( The doc to abc.py invites readers to read this code: **Source code:** :source:`Lib/abc.py` For both reasons, this attribute appears to be part of Python rather than being private to CPython. If so, the special name *should* be documented somewhere. If it should *never* be used anywhere else (which I suspect after seeing that it is not used in numbers.py), that could be said. "__abstractmethods__: A special attribute used within ABCmeta.__new__ that should never be used anywhere else as is has a special-case effect for this one use." The problem with intentionally completely not documenting names publicly accessible in the stdlib code or from the interactive interpreter is that the non-documentation is not documented, and so the issue of documentation will repeatedly arise. The special names section of 'data model' could have a subsection for such. "The following special names are not documented as to their meaning as users should ignore them." or some such. -- Terry Jan Reedy From guido at python.org Sun Aug 7 14:45:41 2011 From: guido at python.org (Guido van Rossum) Date: Sun, 7 Aug 2011 08:45:41 -0400 Subject: [Python-Dev] inconsistent __abstractmethods__ behavior; lack of documentation In-Reply-To: References: <87bow24uvb.fsf@falma.de> Message-ID: On Sat, Aug 6, 2011 at 5:58 PM, Terry Reedy wrote: > On 8/6/2011 8:29 AM, Guido van Rossum wrote: > >> Do you realize that __xxx__ names can have any semantics they darn >> well please? > > That does not seem to be to be the issue Cristoff raised. I apologize, I was too fast on this one. My only excuse is that Christoph didn't indicate he was trying to figure out what another Python implementation should do -- only that he was "playing with ABCs and looking at their implementation". Looking around more I agree that *for implementers of Python* there needs to be some documentation of __abstractmethods__; alternatively, another Python implementation might have to provide a different implementation of abc.py. >> If a particular __xxx__ name (or some aspect of it) is >> undocumented that's not a bug (not even a doc bug), it just means >> "hands off". > > "__abstractmethods__" is used in the stdlib at least in abc.py: > > 95 class ABCMeta(type): > ... > 116 def __new__(mcls, name, bases, namespace): > ... > 123 for name in getattr(base, "__abstractmethods__", set()): > 124 value = getattr(cls, name, None) > 125 if getattr(value, "__isabstractmethod__", False): > 126 abstracts.add(name) > 127 cls.__abstractmethods__ = frozenset(abstracts) > > Since this module implements a PEP (3119) and is not marked as CPython > specific, it should run correctly on all implementations. I wouldn't draw that conclusion. IMO its occurrence as a "pure-Python" module in the stdlib today says nothing about how much of it is tied to CPython or not. That can only be explained in comments, docstrings or offline documentation. (Though there may be some emerging convention that CPython-specific code in the stdlib must be marked in some way that can be detected by tools, I'm not aware that much progress has been made in this area. But admittedly I am not the expert here.) > So implementors > need to know what the above means. ( Here I agree. > The doc to abc.py invites readers to read this code: > **Source code:** :source:`Lib/abc.py` For both reasons, this attribute appears to be part of Python rather than > being private to CPython. If so, the special name *should* be documented > somewhere. I'm happily to agree in this case, but I disagree that you could conclude all this from the evidence you have so far shown. > If it should *never* be used anywhere else (which I suspect after seeing > that it is not used in numbers.py), that could be said. > > "__abstractmethods__: A special attribute used within ABCmeta.__new__ that > should never be used anywhere else as is has a special-case effect for this > one use." > > The problem with intentionally completely not documenting names publicly > accessible in the stdlib code or from the interactive interpreter is that > the non-documentation is not documented, and so the issue of documentation > will repeatedly arise. The special names section of 'data model' could have > a subsection for such. "The following special names are not documented as to > their meaning as users should ignore them." or some such. I disagree. If you see a __dunder__ name which has no documentation you should simply refrain from using it, and no harm will come to you. There is a clearly stated rule in the language reference saying this. It also documents some specific __dunder__ names that are significant for users and can be used in certain specific ways (__init__, __name__, etc.). But I see no reason for a requirement to have an exhaustive list of undocumented __dunder__ names, regardless if they are supposed to be special for CPython only, for all Python versions, for the stdlib only, or whatever. -- --Guido van Rossum (python.org/~guido) From ncoghlan at gmail.com Sun Aug 7 15:03:04 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 7 Aug 2011 23:03:04 +1000 Subject: [Python-Dev] inconsistent __abstractmethods__ behavior; lack of documentation In-Reply-To: References: <87bow24uvb.fsf@falma.de> Message-ID: On Sun, Aug 7, 2011 at 10:45 PM, Guido van Rossum wrote: > On Sat, Aug 6, 2011 at 5:58 PM, Terry Reedy wrote: >> The problem with intentionally completely not documenting names publicly >> accessible in the stdlib code or from the interactive interpreter is that >> the non-documentation is not documented, and so the issue of documentation >> will repeatedly arise. The special names section of 'data model' could have >> a subsection for such. "The following special names are not documented as to >> their meaning as users should ignore them." or some such. > > I disagree. If you see a __dunder__ name which has no documentation > you should simply refrain from using it, and no harm will come to you. > There is a clearly stated rule in the language reference saying this. > It also documents some specific __dunder__ names that are significant > for users and can be used in certain specific ways (__init__, > __name__, etc.). But I see no reason for a requirement to have an > exhaustive list of undocumented __dunder__ names, regardless if they > are supposed to be special for CPython only, for all Python versions, > for the stdlib only, or whatever. Indeed, the way it tends to work out in practice (especially for pure Python code) is that the other implementations will just copy the internal details from CPython, and only if that turns out to be problematic for some reason will they suggest a clarification in the language reference to separate out the details of Python-the-language from CPython-the-implementation in a particular case. That doesn't happen very often, but when it does it generally seems to be because they want to do something more sane than what we do, but the more sane behaviour is hard for us to implement for some reason. Even more rarely such questions may expose an outright bug in the reference implementation (e.g. the operand precedence bug for sequence objects implemented in C that's on my to-do list for 3.3). Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From doug.hellmann at gmail.com Sun Aug 7 17:09:05 2011 From: doug.hellmann at gmail.com (Doug Hellmann) Date: Sun, 7 Aug 2011 11:09:05 -0400 Subject: [Python-Dev] "Meet the Team" on Python Insider Message-ID: [Renewing this request for participation since there are a few new members since the original request went out.] We are running a series of interviews with the Python developers on the python-dev blog (http://blog.python.org). There is a short list of questions below this message. If you would like to be included in the series, please reply directly to me with your answers. We will be doing one or two posts per week, depending on the number of responses and availability of other information to post. We'll have to wait until we have several responses before we start, so we don't announce a new series and then post two messages before it peters out. Please help us by sending your answers as quickly as you can so we can tell what we'll be dealing with. Posts will be published in roughly the order the responses are received, with the text exactly as you send it (unedited, except for formatting it as HTML). If the questions don't apply to you, you can't remember the answer, or don't have an answer, then you can either leave that question blank or interpret it more broadly and give some related information. Blank questions will be omitted from the posts. Thanks, Doug Personal information: name location (city, country, whatever you want to give--we don't need your mailing address) home page or blog url Questions: 1. How long have you been using Python? 2. How long have you been a core committer? 3. How did you get started as a core developer? Do you remember your first commit? 4. Which parts of Python are you working on now? 5. What do you do with Python when you aren't doing core development work? (day job, other projects, etc.) 6. What do you do when you aren't programming? From victor.stinner at haypocalc.com Mon Aug 8 22:26:04 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Mon, 08 Aug 2011 22:26:04 +0200 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: References: <4E35DC94.2090208@mrabarnett.plus.com> Message-ID: <4E40465C.2080500@haypocalc.com> >> With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the >> read returns an empty string (I checked it myself). > > http://bugs.python.org/issue12576 The bug is now fixed. Can you release a Python 3.2.2, maybe only with this fix? Victor From tjreedy at udel.edu Tue Aug 9 01:35:22 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 08 Aug 2011 19:35:22 -0400 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: <4E40465C.2080500@haypocalc.com> References: <4E35DC94.2090208@mrabarnett.plus.com> <4E40465C.2080500@haypocalc.com> Message-ID: On 8/8/2011 4:26 PM, Victor Stinner wrote: >>> With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the >>> read returns an empty string (I checked it myself). >> >> http://bugs.python.org/issue12576 > > The bug is now fixed. Can you release a Python 3.2.2, maybe only with > this fix? Any new release should also have http://bugs.python.org/issue12540 which fixes another bad regression. -- Terry Jan Reedy From doug.hellmann at gmail.com Tue Aug 9 03:45:36 2011 From: doug.hellmann at gmail.com (Doug Hellmann) Date: Mon, 8 Aug 2011 21:45:36 -0400 Subject: [Python-Dev] "Meet the Team" on Python Insider In-Reply-To: References: Message-ID: <9D783268-7A06-4366-8207-AD43935FC1A1@gmail.com> I should have made clear that if you have already completed the survey, we still have your data in the queue. The invitation is for anyone who has not yet sent us the info, including new team members. Doug On Aug 7, 2011, at 11:09 AM, Doug Hellmann wrote: > [Renewing this request for participation since there are a few new members since the original request went out.] > > We are running a series of interviews with the Python developers on the python-dev blog (http://blog.python.org). There is a short list of questions below this message. If you would like to be included in the series, please reply directly to me with your answers. > > We will be doing one or two posts per week, depending on the number of responses and availability of other information to post. We'll have to wait until we have several responses before we start, so we don't announce a new series and then post two messages before it peters out. Please help us by sending your answers as quickly as you can so we can tell what we'll be dealing with. > > Posts will be published in roughly the order the responses are received, with the text exactly as you send it (unedited, except for formatting it as HTML). If the questions don't apply to you, you can't remember the answer, or don't have an answer, then you can either leave that question blank or interpret it more broadly and give some related information. Blank questions will be omitted from the posts. > > Thanks, > Doug > > > > Personal information: > > name > location (city, country, whatever you want to give--we don't need your mailing address) > home page or blog url > > Questions: > > 1. How long have you been using Python? > > 2. How long have you been a core committer? > > 3. How did you get started as a core developer? Do you remember your first commit? > > 4. Which parts of Python are you working on now? > > 5. What do you do with Python when you aren't doing core development work? (day job, other projects, etc.) > > 6. What do you do when you aren't programming? > From g.brandl at gmx.net Tue Aug 9 08:02:45 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 09 Aug 2011 08:02:45 +0200 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: References: <4E35DC94.2090208@mrabarnett.plus.com> <4E40465C.2080500@haypocalc.com> Message-ID: Am 09.08.2011 01:35, schrieb Terry Reedy: > On 8/8/2011 4:26 PM, Victor Stinner wrote: >>>> With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the >>>> read returns an empty string (I checked it myself). >>> >>> http://bugs.python.org/issue12576 >> >> The bug is now fixed. Can you release a Python 3.2.2, maybe only with >> this fix? > > Any new release should also have > http://bugs.python.org/issue12540 > which fixes another bad regression. I can certainly release a version with these two fixes. Question is, should we call it 3.2.2, or 3.2.1.1 (3.2.1p1)? Georg From tjreedy at udel.edu Tue Aug 9 09:08:24 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 09 Aug 2011 03:08:24 -0400 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: References: <4E35DC94.2090208@mrabarnett.plus.com> <4E40465C.2080500@haypocalc.com> Message-ID: On 8/9/2011 2:02 AM, Georg Brandl wrote: > Am 09.08.2011 01:35, schrieb Terry Reedy: >> On 8/8/2011 4:26 PM, Victor Stinner wrote: >>>>> With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the >>>>> read returns an empty string (I checked it myself). >>>> >>>> http://bugs.python.org/issue12576 >>> >>> The bug is now fixed. Can you release a Python 3.2.2, maybe only with >>> this fix? >> >> Any new release should also have >> http://bugs.python.org/issue12540 >> which fixes another bad regression. > > I can certainly release a version with these two fixes. Question is, should > we call it 3.2.2, or 3.2.1.1 (3.2.1p1)? I believe precedent and practicality say 3.2.2. How much more, if anything is up to you. The important question is whether Martin is willing to do a Windows installer, as 12540 only affects Windows. -- Terry Jan Reedy From socketpair at gmail.com Tue Aug 9 11:31:47 2011 From: socketpair at gmail.com (=?UTF-8?B?0JzQsNGA0Log0JrQvtGA0LXQvdCx0LXRgNCz?=) Date: Tue, 9 Aug 2011 15:31:47 +0600 Subject: [Python-Dev] GIL removal question Message-ID: Probably I want to re-invent a bicycle. I want developers to say me why we can not remove GIL in that way: 1. Remove GIL completely with all current logick. 2. Add it's own RW-locking to all mutable objects (like list or dict) 3. Add RW-locks to every context instance 4. use RW-locks when accessing members of object instances Only one reason, I see, not do that -- is performance of singlethreaded applications. Why not to fix locking functions for this 4 cases to stubs when only one thread present? For atomicity, locks may be implemented as this: For example for this source: -------------------------------- import threading def x(): i=1000 while i: i-- a = threading.Thread(target=x) b = threading.Thread(target=x) a.start() b.start() a.join() b.join() -------------------------------- in my case it will be fully parallel, as common object is not locked much (only global context when a.xxxx = yyyy executed). I think, performance of such code will be higher that using GIL. Other significant reason of not using my case, as I think, is a plenty of atomic processor instructions in each thread, which affect kernel performance. Also, I know about incompatibility my variant with existing code. In a summary: Please say clearly why, actually, my variant is not still implemented. Thanks. -- Segmentation fault From socketpair at gmail.com Tue Aug 9 11:33:10 2011 From: socketpair at gmail.com (=?UTF-8?B?0JzQsNGA0Log0JrQvtGA0LXQvdCx0LXRgNCz?=) Date: Tue, 9 Aug 2011 15:33:10 +0600 Subject: [Python-Dev] GIL removal question Message-ID: Probably I want to re-invent a bicycle. I want developers to say me why we can not remove GIL in that way: 1. Remove GIL completely with all current logick. 2. Add it's own RW-locking to all mutable objects (like list or dict) 3. Add RW-locks to every context instance 4. use RW-locks when accessing members of object instances Only one reason, I see, not do that -- is performance of singlethreaded applications. Why not to fix locking functions for this 4 cases to stubs when only one thread present? For atomicity, locks may be implemented as this: For example for this source: -------------------------------- import threading def x(): i=1000 while i: i-- a = threading.Thread(target=x) b = threading.Thread(target=x) a.start() b.start() a.join() b.join() -------------------------------- in my case it will be fully parallel, as common object is not locked much (only global context when a.xxxx = yyyy executed). I think, performance of such code will be higher that using GIL. Other significant reason of not using my case, as I think, is a plenty of atomic processor instructions in each thread, which affect kernel performance. Also, I know about incompatibility my variant with existing code. In a summary: Please say clearly why, actually, my variant is not still implemented. Thanks. -- Segmentation fault From arfrever.fta at gmail.com Tue Aug 9 15:53:17 2011 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Tue, 9 Aug 2011 15:53:17 +0200 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: References: <4E35DC94.2090208@mrabarnett.plus.com> Message-ID: <201108091553.18632.Arfrever.FTA@gmail.com> 2011-08-09 08:02:45 Georg Brandl napisa?(a): > Am 09.08.2011 01:35, schrieb Terry Reedy: > > On 8/8/2011 4:26 PM, Victor Stinner wrote: > >>>> With Python 3.1 and Python 3.2.1 it works OK, but with Python 3.2.1 the > >>>> read returns an empty string (I checked it myself). > >>> > >>> http://bugs.python.org/issue12576 > >> > >> The bug is now fixed. Can you release a Python 3.2.2, maybe only with > >> this fix? > > > > Any new release should also have > > http://bugs.python.org/issue12540 > > which fixes another bad regression. > > I can certainly release a version with these two fixes. Question is, should > we call it 3.2.2, or 3.2.1.1 (3.2.1p1)? I would suggest that a normal release with all changes committed on 3.2 branch be created. -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. URL: From stefan_ml at behnel.de Tue Aug 9 16:11:07 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 09 Aug 2011 16:11:07 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: References: Message-ID: ???? ?????????, 09.08.2011 11:31: > In a summary: Please say clearly why, actually, my variant is not > still implemented. This question comes up on the different Python lists every once in a while. In general, if you want something to be implemented in a specific way, feel free to provide the implementation. There were several attempts to remove the GIL from the interpreter, you can look them up in the archives of this mailing list. They all failed to provide competitive performance, especially for the single-threaded case, and were therefore deemed inappropriate "solutions" to the "problem". Note that I put "problem" into quotes, simply because it is controversial if the GIL actually *is* a problem. This question has also been discussed and rediscussed in great length on the different Python lists. Stefan From barry at python.org Tue Aug 9 16:25:29 2011 From: barry at python.org (Barry Warsaw) Date: Tue, 9 Aug 2011 10:25:29 -0400 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: References: <4E35DC94.2090208@mrabarnett.plus.com> <4E40465C.2080500@haypocalc.com> Message-ID: <20110809102529.0f60dd93@resist.wooz.org> On Aug 09, 2011, at 08:02 AM, Georg Brandl wrote: >I can certainly release a version with these two fixes. Question is, should >we call it 3.2.2, or 3.2.1.1 (3.2.1p1)? Definitely 3.2.2. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From g.brandl at gmx.net Tue Aug 9 20:36:50 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 09 Aug 2011 20:36:50 +0200 Subject: [Python-Dev] urllib bug in Python 3.2.1? In-Reply-To: <20110809102529.0f60dd93@resist.wooz.org> References: <4E35DC94.2090208@mrabarnett.plus.com> <4E40465C.2080500@haypocalc.com> <20110809102529.0f60dd93@resist.wooz.org> Message-ID: Am 09.08.2011 16:25, schrieb Barry Warsaw: > On Aug 09, 2011, at 08:02 AM, Georg Brandl wrote: > >>I can certainly release a version with these two fixes. Question is, should >>we call it 3.2.2, or 3.2.1.1 (3.2.1p1)? > > Definitely 3.2.2. OK, 3.2.2 it is. I will have to have a closer look at the other changes in the branch to decide if it'll be a single(double)-fix only release. Schedule would be roughly as follows: rc1 this Friday/Saturday, then I'm on vacation for a little more than one week, so final would be the weekend of 27/28 August. Georg From dave at dabeaz.com Wed Aug 10 13:09:07 2011 From: dave at dabeaz.com (David Beazley) Date: Wed, 10 Aug 2011 06:09:07 -0500 Subject: [Python-Dev] GIL removal question In-Reply-To: References: Message-ID: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> > > Message: 1 > Date: Tue, 9 Aug 2011 15:31:47 +0600 > From: ???? ????????? > To: python-dev at python.org > Subject: [Python-Dev] GIL removal question > Message-ID: > > Content-Type: text/plain; charset=UTF-8 > > Probably I want to re-invent a bicycle. I want developers to say me > why we can not remove GIL in that way: > > 1. Remove GIL completely with all current logick. > 2. Add it's own RW-locking to all mutable objects (like list or dict) > 3. Add RW-locks to every context instance > 4. use RW-locks when accessing members of object instances You're forgetting step 5. 5. Put fine-grain locks around all reference counting operations (or rewrite all of Python's memory management and garbage collection from scratch). > Only one reason, I see, not do that -- is performance of > singlethreaded applications. After implementing the aforementioned step 5, you will find that the performance of everything, including the threaded code, will be quite a bit worse. Frankly, this is probably the most significant obstacle to have any kind of GIL-less Python with reasonable performance. Just as an aside, I recently did some experiments with the fabled patch to remove the GIL from Python 1.4 (mainly for my own historical curiosity). On Linux, the performance isn't just slightly worse, it makes single-threaded code run about 6-7 times slower and threaded code runs even worse. So, basically everything runs like a dog. No GIL though. Cheers, Dave From ncoghlan at gmail.com Wed Aug 10 13:15:37 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 10 Aug 2011 21:15:37 +1000 Subject: [Python-Dev] GIL removal question In-Reply-To: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> Message-ID: On Wed, Aug 10, 2011 at 9:09 PM, David Beazley wrote: > You're forgetting step 5. > > 5. Put fine-grain locks around all reference counting operations (or rewrite all of Python's memory management and garbage collection from scratch). ... > After implementing the aforementioned step 5, you will find that the performance of everything, including the threaded code, will be quite a bit worse. ?Frankly, this is probably the most significant obstacle to have any kind of GIL-less Python with reasonable performance. PyPy would actually make a significantly better basis for this kind of experimentation, since they *don't* use reference counting for their memory management. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From dave at dabeaz.com Wed Aug 10 13:32:27 2011 From: dave at dabeaz.com (David Beazley) Date: Wed, 10 Aug 2011 06:32:27 -0500 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> Message-ID: <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> On Aug 10, 2011, at 6:15 AM, Nick Coghlan wrote: > On Wed, Aug 10, 2011 at 9:09 PM, David Beazley wrote: >> You're forgetting step 5. >> >> 5. Put fine-grain locks around all reference counting operations (or rewrite all of Python's memory management and garbage collection from scratch). > ... >> After implementing the aforementioned step 5, you will find that the performance of everything, including the threaded code, will be quite a bit worse. Frankly, this is probably the most significant obstacle to have any kind of GIL-less Python with reasonable performance. > > PyPy would actually make a significantly better basis for this kind of > experimentation, since they *don't* use reference counting for their > memory management. > That's an experiment that would pretty interesting. I think the real question would boil down to what *else* do they have to lock to make everything work. Reference counting is a huge bottleneck for CPython to be sure, but it's definitely not the only issue that has to be addressed in making a free-threaded Python. Cheers, Dave From ncoghlan at gmail.com Wed Aug 10 13:42:22 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 10 Aug 2011 21:42:22 +1000 Subject: [Python-Dev] GIL removal question In-Reply-To: <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: On Wed, Aug 10, 2011 at 9:32 PM, David Beazley wrote: > On Aug 10, 2011, at 6:15 AM, Nick Coghlan wrote: >> PyPy would actually make a significantly better basis for this kind of >> experimentation, since they *don't* use reference counting for their >> memory management. > > That's an experiment that would pretty interesting. ?I think the real question would boil down to what *else* do they have to lock to make everything work. ? Reference counting is a huge bottleneck for CPython to be sure, but it's definitely not the only issue that has to be addressed in making a free-threaded Python. > Yeah, the problem reduces back to the 4 steps in the original post. Still not trivial, since there's quite a bit of internal interpreter state to protect, but significantly more feasible than dealing with CPython's reference counting. However, you do get additional complexities like the JIT compiler coming into play, so it is really a question that would need to be raised directly with the PyPy dev team. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From guido at python.org Wed Aug 10 13:43:09 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 10 Aug 2011 07:43:09 -0400 Subject: [Python-Dev] GIL removal question In-Reply-To: <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: On Wed, Aug 10, 2011 at 7:32 AM, David Beazley wrote: > > On Aug 10, 2011, at 6:15 AM, Nick Coghlan wrote: > >> On Wed, Aug 10, 2011 at 9:09 PM, David Beazley wrote: >>> You're forgetting step 5. >>> >>> 5. Put fine-grain locks around all reference counting operations (or rewrite all of Python's memory management and garbage collection from scratch). >> ... >>> After implementing the aforementioned step 5, you will find that the performance of everything, including the threaded code, will be quite a bit worse. ?Frankly, this is probably the most significant obstacle to have any kind of GIL-less Python with reasonable performance. >> >> PyPy would actually make a significantly better basis for this kind of >> experimentation, since they *don't* use reference counting for their >> memory management. >> > > That's an experiment that would pretty interesting. ?I think the real question would boil down to what *else* do they have to lock to make everything work. ? Reference counting is a huge bottleneck for CPython to be sure, but it's definitely not the only issue that has to be addressed in making a free-threaded Python. They have a specific plan, based on Software Transactional Memory: http://morepypy.blogspot.com/2011/06/global-interpreter-lock-or-how-to-kill.html Personally, I'm not holding my breath, because STM in other areas has so far captured many imaginations without bringing practical results (I keep hearing about it as this promising theory that needs more work to implement, sort-of like String Theory in theoretical physics). But I'm also not denying that Armin Rigo has a brain the size of the planet, and PyPy *has* already made much real, practical progress. -- --Guido van Rossum (python.org/~guido) From fijall at gmail.com Wed Aug 10 17:20:28 2011 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 10 Aug 2011 17:20:28 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: On Wed, Aug 10, 2011 at 1:43 PM, Guido van Rossum wrote: > On Wed, Aug 10, 2011 at 7:32 AM, David Beazley wrote: >> >> On Aug 10, 2011, at 6:15 AM, Nick Coghlan wrote: >> >>> On Wed, Aug 10, 2011 at 9:09 PM, David Beazley wrote: >>>> You're forgetting step 5. >>>> >>>> 5. Put fine-grain locks around all reference counting operations (or rewrite all of Python's memory management and garbage collection from scratch). >>> ... >>>> After implementing the aforementioned step 5, you will find that the performance of everything, including the threaded code, will be quite a bit worse. ?Frankly, this is probably the most significant obstacle to have any kind of GIL-less Python with reasonable performance. >>> >>> PyPy would actually make a significantly better basis for this kind of >>> experimentation, since they *don't* use reference counting for their >>> memory management. >>> >> >> That's an experiment that would pretty interesting. ?I think the real question would boil down to what *else* do they have to lock to make everything work. ? Reference counting is a huge bottleneck for CPython to be sure, but it's definitely not the only issue that has to be addressed in making a free-threaded Python. > > They have a specific plan, based on Software Transactional Memory: > http://morepypy.blogspot.com/2011/06/global-interpreter-lock-or-how-to-kill.html > > Personally, I'm not holding my breath, because STM in other areas has > so far captured many imaginations without bringing practical results > (I keep hearing about it as this promising theory that needs more work > to implement, sort-of like String Theory in theoretical physics). Note that the PyPy's plan does *not* assume the end result will be comparable in the single-threaded case. The goal is to be able to compile two *different* pypy's, one fast single-threaded, one gil-less, but with a significant overhead. The trick is to get this working in a way that does not increase maintenance burden. It's also research, so among other things it might not work. Cheers, fijal From riscutiavlad at gmail.com Wed Aug 10 18:14:49 2011 From: riscutiavlad at gmail.com (Vlad Riscutia) Date: Wed, 10 Aug 2011 09:14:49 -0700 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: Removing GIL is interesting work and probably multiple people are willing to contribute. Threading and synchronization is a deep topic and it might be that if just one person toys around with removing GIL he might not see performance improvement (not meaning to offend anyone who tried this, honestly) but what about forking a branch for this work, with some good benchmarks in place and have community contribute? Let's say first step would be just replacing GIL with some fine grained locks with expected performance degradation but afterwards we can try to incrementally improve on this. Thank you, Vlad On Wed, Aug 10, 2011 at 8:20 AM, Maciej Fijalkowski wrote: > On Wed, Aug 10, 2011 at 1:43 PM, Guido van Rossum > wrote: > > On Wed, Aug 10, 2011 at 7:32 AM, David Beazley wrote: > >> > >> On Aug 10, 2011, at 6:15 AM, Nick Coghlan wrote: > >> > >>> On Wed, Aug 10, 2011 at 9:09 PM, David Beazley > wrote: > >>>> You're forgetting step 5. > >>>> > >>>> 5. Put fine-grain locks around all reference counting operations (or > rewrite all of Python's memory management and garbage collection from > scratch). > >>> ... > >>>> After implementing the aforementioned step 5, you will find that the > performance of everything, including the threaded code, will be quite a bit > worse. Frankly, this is probably the most significant obstacle to have any > kind of GIL-less Python with reasonable performance. > >>> > >>> PyPy would actually make a significantly better basis for this kind of > >>> experimentation, since they *don't* use reference counting for their > >>> memory management. > >>> > >> > >> That's an experiment that would pretty interesting. I think the real > question would boil down to what *else* do they have to lock to make > everything work. Reference counting is a huge bottleneck for CPython to be > sure, but it's definitely not the only issue that has to be addressed in > making a free-threaded Python. > > > > They have a specific plan, based on Software Transactional Memory: > > > http://morepypy.blogspot.com/2011/06/global-interpreter-lock-or-how-to-kill.html > > > > Personally, I'm not holding my breath, because STM in other areas has > > so far captured many imaginations without bringing practical results > > (I keep hearing about it as this promising theory that needs more work > > to implement, sort-of like String Theory in theoretical physics). > > Note that the PyPy's plan does *not* assume the end result will be > comparable in the single-threaded case. The goal is to be able to > compile two *different* pypy's, one fast single-threaded, one > gil-less, but with a significant overhead. The trick is to get this > working in a way that does not increase maintenance burden. It's also > research, so among other things it might not work. > > Cheers, > fijal > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/riscutiavlad%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian.curtin at gmail.com Wed Aug 10 18:19:12 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Wed, 10 Aug 2011 11:19:12 -0500 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: On Wed, Aug 10, 2011 at 11:14, Vlad Riscutia wrote: > Removing GIL is interesting work and probably multiple people are willing > to contribute. Threading and synchronization is a deep topic and it might be > that if just one person toys around with removing GIL he might not see > performance improvement (not meaning to offend anyone who tried this, > honestly) but what about forking a branch for this work, with some good > benchmarks in place and have community contribute? Let's say first step > would be just replacing GIL with some fine grained locks with expected > performance degradation but afterwards we can try to incrementally improve > on this. > > Thank you, > Vlad > Feel free to start this: http://hg.python.org/cpython -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericsnowcurrently at gmail.com Wed Aug 10 19:04:05 2011 From: ericsnowcurrently at gmail.com (Eric Snow) Date: Wed, 10 Aug 2011 11:04:05 -0600 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: On Wed, Aug 10, 2011 at 10:19 AM, Brian Curtin wrote: > On Wed, Aug 10, 2011 at 11:14, Vlad Riscutia wrote: >> >> Removing GIL is interesting work and probably multiple people are willing >> to contribute.?Threading and synchronization is a deep topic and it might be >> that if just one person toys around with removing GIL he might not see >> performance improvement (not meaning to offend anyone who tried this, >> honestly) but what about forking a branch for this work, with some good >> benchmarks in place and have community contribute? Let's say first step >> would be just replacing GIL with some fine grained locks with expected >> performance degradation but afterwards we can try to incrementally improve >> on this. >> Thank you, >> Vlad > > Feel free to start this:?http://hg.python.org/cpython +1 on not waiting for someone else to do it if you have an idea. :) Bitbucket makes it really easy for anyone to fork a repo into a new project and they keep an up to date mirror of the CPython repo: https://bitbucket.org/mirror/cpython/overview -eric > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/ericsnowcurrently%40gmail.com > > From raymond.hettinger at gmail.com Wed Aug 10 19:19:06 2011 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Wed, 10 Aug 2011 10:19:06 -0700 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> Message-ID: <8A2B2AFE-8AD4-4379-A2E9-64987CA52B68@gmail.com> On Aug 10, 2011, at 4:15 AM, Nick Coghlan wrote: >> After implementing the aforementioned step 5, you will find that the performance of everything, including the threaded code, will be quite a bit worse. Frankly, this is probably the most significant obstacle to have any kind of GIL-less Python with reasonable performance. > > PyPy would actually make a significantly better basis for this kind of > experimentation, since they *don't* use reference counting for their > memory management. Jython may be a better choice. It is all about concurrency. Its dicts are built on top of Java's ConcurrentHashMap for example. Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian.curtin at gmail.com Wed Aug 10 20:55:56 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Wed, 10 Aug 2011 13:55:56 -0500 Subject: [Python-Dev] Moving forward with the concurrent package Message-ID: Now that we have concurrent.futures, is there any plan for multiprocessing to follow suit? PEP 3148 mentions a hope to add or move things in the future [0], which would be now. [0] http://www.python.org/dev/peps/pep-3148/#naming -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Wed Aug 10 21:54:33 2011 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 10 Aug 2011 14:54:33 -0500 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: References: Message-ID: 2011/8/10 Brian Curtin : > Now that we have concurrent.futures, is there any plan for multiprocessing > to follow suit? PEP 3148 mentions a hope to add or move things in the future Is there some sort of concrete proposal? The PEP just seems to mention it as an idea. In general, -1. I think we don't need to be moving things around more to little advantage. -- Regards, Benjamin From fijall at gmail.com Wed Aug 10 22:06:58 2011 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 10 Aug 2011 22:06:58 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <8A2B2AFE-8AD4-4379-A2E9-64987CA52B68@gmail.com> References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <8A2B2AFE-8AD4-4379-A2E9-64987CA52B68@gmail.com> Message-ID: On Wed, Aug 10, 2011 at 7:19 PM, Raymond Hettinger wrote: > > On Aug 10, 2011, at 4:15 AM, Nick Coghlan wrote: > > After implementing the aforementioned step 5, you will find that the > performance of everything, including the threaded code, will be quite a bit > worse. ?Frankly, this is probably the most significant obstacle to have any > kind of GIL-less Python with reasonable performance. > > PyPy would actually make a significantly better basis for this kind of > experimentation, since they *don't* use reference counting for their > memory management. > > Jython may be a better choice. ?It is all about concurrency. ?Its dicts are > built on top of?Java's ConcurrentHashMap for example. > Jython is kind of boring choice because it does not have a GIL at all (same as IronPython). It might *work* for what you're trying to achieve but GIL-removal is not really that interesting. From solipsis at pitrou.net Wed Aug 10 22:36:43 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 10 Aug 2011 22:36:43 +0200 Subject: [Python-Dev] Moving forward with the concurrent package References: Message-ID: <20110810223643.5fadea2d@msiwind> Le Wed, 10 Aug 2011 14:54:33 -0500, Benjamin Peterson a ?crit : > 2011/8/10 Brian Curtin : > > Now that we have concurrent.futures, is there any plan for > > multiprocessing to follow suit? PEP 3148 mentions a hope to add or move > > things in the future > > Is there some sort of concrete proposal? The PEP just seems to mention > it as an idea. > > In general, -1. I think we don't need to be moving things around more > to little advantage. Agreed. Also, flat is better than nested. Whoever wants to populate the concurrent package should work on new features to be added to it, rather than plans to rename things around. Regards Antoine. From brian.curtin at gmail.com Wed Aug 10 22:45:40 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Wed, 10 Aug 2011 15:45:40 -0500 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: <20110810223643.5fadea2d@msiwind> References: <20110810223643.5fadea2d@msiwind> Message-ID: On Wed, Aug 10, 2011 at 15:36, Antoine Pitrou wrote: > Le Wed, 10 Aug 2011 14:54:33 -0500, > Benjamin Peterson a ?crit : > > 2011/8/10 Brian Curtin : > > > Now that we have concurrent.futures, is there any plan for > > > multiprocessing to follow suit? PEP 3148 mentions a hope to add or move > > > things in the future > > > > Is there some sort of concrete proposal? The PEP just seems to mention > > it as an idea. > > > > In general, -1. I think we don't need to be moving things around more > > to little advantage. > > Agreed. Also, flat is better than nested. Whoever wants to populate the > concurrent package should work on new features to be added to it, rather > than plans to rename things around. I agree with flat being better than nested and won't be pushing to move things around, but the creation of the concurrent package seemed like a place to put those things. I just found myself typing "concurrent.multiprocessing" a minute ago, so I figured I'd put it out there. -------------- next part -------------- An HTML attachment was scrubbed... URL: From raymond.hettinger at gmail.com Wed Aug 10 22:46:41 2011 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Wed, 10 Aug 2011 13:46:41 -0700 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: <20110810223643.5fadea2d@msiwind> References: <20110810223643.5fadea2d@msiwind> Message-ID: On Aug 10, 2011, at 1:36 PM, Antoine Pitrou wrote: > Le Wed, 10 Aug 2011 14:54:33 -0500, > Benjamin Peterson a ?crit : >> 2011/8/10 Brian Curtin : >>> Now that we have concurrent.futures, is there any plan for >>> multiprocessing to follow suit? PEP 3148 mentions a hope to add or move >>> things in the future >> >> Is there some sort of concrete proposal? The PEP just seems to mention >> it as an idea. >> >> In general, -1. I think we don't need to be moving things around more >> to little advantage. > > Agreed. Also, flat is better than nested. Whoever wants to populate the > concurrent package should work on new features to be added to it, rather > than plans to rename things around. I concur. Raymond From sandro.tosi at gmail.com Wed Aug 10 23:02:42 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Wed, 10 Aug 2011 23:02:42 +0200 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: <4E42E24B.8020601@udel.edu> References: <4E42E24B.8020601@udel.edu> Message-ID: On Wed, Aug 10, 2011 at 21:55, Terry Reedy wrote: > >> >> ? ? Latest version of the `heapq Python source code >> >> -`_ >> >> +`_ > > Should links be to the hg repository instead of svn? > Is svn updated from hg? > I thought is was (mostly) historical read-only. I made the same remark to Senthil on IRC, and came out that web frontend for hg.p.o doesn't allow for a nice way to specify a branch (different than 'default'), it's something like hg.python.org/cpython//path/to/file.py which is almost always outdated :) What do we use to provide the web part of hg.p.o? maybe we can just ask the developers of this tool to provide (or advertize) a proper way to select a branch. If some can provide me some info, I can do the "ask the devs" part. Cheers, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From ezio.melotti at gmail.com Wed Aug 10 22:58:24 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Wed, 10 Aug 2011 23:58:24 +0300 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: References: <4E42E24B.8020601@udel.edu> Message-ID: <4E42F0F0.1070203@gmail.com> On 11/08/2011 0.02, Sandro Tosi wrote: > On Wed, Aug 10, 2011 at 21:55, Terry Reedy wrote: >>> Latest version of the `heapq Python source code >>> >>> -`_ >>> >>> +`_ >> Should links be to the hg repository instead of svn? >> Is svn updated from hg? >> I thought is was (mostly) historical read-only. > I made the same remark to Senthil on IRC, and came out that web > frontend for hg.p.o doesn't allow for a nice way to specify a branch > (different than 'default'), it's something like > hg.python.org/cpython//path/to/file.py > which is almost always outdated :) hg.python.org/cpython/2.7/path/to/file.py should work just fine. IIRC the reason why we don't do it on 2.x is because we don't have the 'source' directive available in Sphinx and therefore we would have to update all the links manually to link to h.p.o instead of s.p.o. Best Regards, Ezio Melotti > > What do we use to provide the web part of hg.p.o? maybe we can just > ask the developers of this tool to provide (or advertize) a proper way > to select a branch. If some can provide me some info, I can do the > "ask the devs" part. > > Cheers, From jnoller at gmail.com Wed Aug 10 23:34:42 2011 From: jnoller at gmail.com (Jesse Noller) Date: Wed, 10 Aug 2011 17:34:42 -0400 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: References: <20110810223643.5fadea2d@msiwind> Message-ID: On Wed, Aug 10, 2011 at 4:45 PM, Brian Curtin wrote: > On Wed, Aug 10, 2011 at 15:36, Antoine Pitrou wrote: >> >> Le Wed, 10 Aug 2011 14:54:33 -0500, >> Benjamin Peterson a ?crit : >> > 2011/8/10 Brian Curtin : >> > > Now that we have concurrent.futures, is there any plan for >> > > multiprocessing to follow suit? PEP 3148 mentions a hope to add or >> > > move >> > > things in the future >> > >> > Is there some sort of concrete proposal? The PEP just seems to mention >> > it as an idea. >> > >> > In general, -1. I think we don't need to be moving things around more >> > to little advantage. >> >> Agreed. Also, flat is better than nested. Whoever wants to populate the >> concurrent package should work on new features to be added to it, rather >> than plans to rename things around. > > I agree with flat being better than nested and won't be pushing to move > things around, but the creation of the concurrent package seemed like a > place to put those things. I just found myself typing > "concurrent.multiprocessing" a minute ago, so I figured I'd put it out > there. I would like to move certain *features* of multiprocessing into that namespace - some things like map and others don't belong in the multiprocessing namespace, and should have been put into concurrent.* a long time ago. As for my plans: I had intended on making multiprocessing a closer corollary to threading, and moving the bigger features that should have been broken out into a different package (such as http://bugs.python.org/issue12708) and the managers. Those plans are obviously stalled as my time is being spent elsewhere. I disagree on the "flat is better than nested" point - multiprocessing's namespace is flat - but bloated, and many of it's features could work just as well in a threaded context (e.g, they are generally useful outside of multiprocessing alone). Regardless; currently I can't lead this, and multiprocessing-sig is silent. Jesse From benjamin at python.org Wed Aug 10 23:37:16 2011 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 10 Aug 2011 16:37:16 -0500 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: References: <20110810223643.5fadea2d@msiwind> Message-ID: 2011/8/10 Raymond Hettinger : > > On Aug 10, 2011, at 1:36 PM, Antoine Pitrou wrote: > >> Le Wed, 10 Aug 2011 14:54:33 -0500, >> Benjamin Peterson a ?crit : >>> 2011/8/10 Brian Curtin : >>>> Now that we have concurrent.futures, is there any plan for >>>> multiprocessing to follow suit? PEP 3148 mentions a hope to add or move >>>> things in the future >>> >>> Is there some sort of concrete proposal? The PEP just seems to mention >>> it as an idea. >>> >>> In general, -1. I think we don't need to be moving things around more >>> to little advantage. >> >> Agreed. Also, flat is better than nested. Whoever wants to populate the >> concurrent package should work on new features to be added to it, rather >> than plans to rename things around. > > I concur. So we could put yourself, Antoine, and me in the concurrent package. :) Sorry, Benjamin From ncoghlan at gmail.com Thu Aug 11 01:03:35 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 11 Aug 2011 09:03:35 +1000 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: References: Message-ID: On Thu, Aug 11, 2011 at 4:55 AM, Brian Curtin wrote: > Now that we have concurrent.futures, is there any plan for multiprocessing > to follow suit? PEP 3148 mentions a hope to add or move things in the future > [0], which would be now. As Jesse said, moving multiprocessing or threading wholesale was never part of the plan. The main motivator of that comment in PEP 3148 was the idea of creating 'concurrent.pool', which would provide a concurrent worker pool API modelled on multiprocessing.Pool that supported either threads or processes as the back end, just like the executor model in concurrent.futures. The basic approach is to look at a feature in threading or multiprocessing that is only available in one of them and ask the question: Does it make sense to allow a project to switch easily between a threading strategy and a multiprocessing strategy when using this feature? If the answer to that question is yes (as it was for concurrent.futures itself, and as I believe it to be for multiprocessing.Pool), then a feature request (and probably a PEP) proposing the definition of a common API in the concurrent namespace would be appropriate. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From jnoller at gmail.com Thu Aug 11 01:06:36 2011 From: jnoller at gmail.com (Jesse Noller) Date: Wed, 10 Aug 2011 19:06:36 -0400 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: References: Message-ID: On Wed, Aug 10, 2011 at 7:03 PM, Nick Coghlan wrote: > On Thu, Aug 11, 2011 at 4:55 AM, Brian Curtin wrote: >> Now that we have concurrent.futures, is there any plan for multiprocessing >> to follow suit? PEP 3148 mentions a hope to add or move things in the future >> [0], which would be now. > > As Jesse said, moving multiprocessing or threading wholesale was never > part of the plan. The main motivator of that comment in PEP 3148 was > the idea of creating 'concurrent.pool', which would provide a > concurrent worker pool API modelled on multiprocessing.Pool that > supported either threads or processes as the back end, just like the > executor model in concurrent.futures. > > The basic approach is to look at a feature in threading or > multiprocessing that is only available in one of them and ask the > question: Does it make sense to allow a project to switch easily > between a threading strategy and a multiprocessing strategy when using > this feature? > > If the answer to that question is yes (as it was for > concurrent.futures itself, and as I believe it to be for > multiprocessing.Pool), then a feature request (and probably a PEP) > proposing the definition of a common API in the concurrent namespace > would be appropriate. > Precisely. Thank you Nick, want a job working for PyCon? ;) From senthil at uthcode.com Thu Aug 11 02:13:49 2011 From: senthil at uthcode.com (Senthil Kumaran) Date: Thu, 11 Aug 2011 08:13:49 +0800 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: <4E42F0F0.1070203@gmail.com> References: <4E42E24B.8020601@udel.edu> <4E42F0F0.1070203@gmail.com> Message-ID: <20110811001349.GA2146@mathmagic> On Wed, Aug 10, 2011 at 11:58:24PM +0300, Ezio Melotti wrote: > > hg.python.org/cpython/2.7/path/to/file.py should work just fine. The correct path seems to be: http://hg.python.org/cpython/file/2.7/Lib/ > > IIRC the reason why we don't do it on 2.x is because we don't have > the 'source' directive available in Sphinx and therefore we would > have to update all the links manually to link to h.p.o instead of > s.p.o. I see. Does sphinx have any such directive already? How is it supposed to behave? -- Senthil From raymond.hettinger at gmail.com Thu Aug 11 02:20:35 2011 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Wed, 10 Aug 2011 17:20:35 -0700 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: <4E42F0F0.1070203@gmail.com> References: <4E42E24B.8020601@udel.edu> <4E42F0F0.1070203@gmail.com> Message-ID: <14107F2C-DC29-44FC-8DC5-45C6AEB2A598@gmail.com> On Aug 10, 2011, at 1:58 PM, Ezio Melotti wrote: > we would have to update all the links manually to link to h.p.o instead of s.p.o. sed is your friend. Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.brandl at gmx.net Thu Aug 11 07:26:14 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Thu, 11 Aug 2011 07:26:14 +0200 Subject: [Python-Dev] cpython: News item for #12724 In-Reply-To: References: Message-ID: Am 11.08.2011 03:34, schrieb brian.curtin: > http://hg.python.org/cpython/rev/3a6782f2a4a8 > changeset: 71811:3a6782f2a4a8 > user: Brian Curtin > date: Wed Aug 10 20:32:10 2011 -0500 > summary: > News item for #12724 > > files: > Misc/NEWS | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) If it gets a NEWS entry, why isn't it documented? Georg From solipsis at pitrou.net Thu Aug 11 09:02:42 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 11 Aug 2011 09:02:42 +0200 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. References: Message-ID: <20110811090242.1083782f@msiwind> Le Thu, 11 Aug 2011 03:34:37 +0200, brian.curtin a ?crit : > http://hg.python.org/cpython/rev/77a65b078852 > changeset: 71809:77a65b078852 > parent: 71803:1b4fae183da3 > user: Brian Curtin > date: Wed Aug 10 20:05:21 2011 -0500 > summary: > Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. It would sound more useful to have a generic Py_RETURN() macro rather than some specific forms for each and every common object. Regards Antoine. From solipsis at pitrou.net Thu Aug 11 09:07:47 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 11 Aug 2011 09:07:47 +0200 Subject: [Python-Dev] Moving forward with the concurrent package References: Message-ID: <20110811090747.10643b8d@msiwind> Le Thu, 11 Aug 2011 09:03:35 +1000, Nick Coghlan a ?crit : > On Thu, Aug 11, 2011 at 4:55 AM, Brian Curtin > wrote: > > Now that we have concurrent.futures, is there any plan for > > multiprocessing to follow suit? PEP 3148 mentions a hope to add or move > > things in the future [0], which would be now. > > As Jesse said, moving multiprocessing or threading wholesale was never > part of the plan. The main motivator of that comment in PEP 3148 was > the idea of creating 'concurrent.pool', which would provide a > concurrent worker pool API modelled on multiprocessing.Pool that > supported either threads or processes as the back end, just like the > executor model in concurrent.futures. Executors *are* pools, so I don't know what you're talking about. Besides, multiprocessing.Pool is quite bloated and therefore difficult to improve. It should be slowly phased out in favour of concurrent.futures. In general, it would be nice if people wanting to improve the concurrent primitives made actual, concrete propositions. We've had lots of hand-waving in that area for years, to no effect. Regards Antoine. From ncoghlan at gmail.com Thu Aug 11 09:56:59 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 11 Aug 2011 17:56:59 +1000 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: <20110811090747.10643b8d@msiwind> References: <20110811090747.10643b8d@msiwind> Message-ID: On Thu, Aug 11, 2011 at 5:07 PM, Antoine Pitrou wrote: > Le Thu, 11 Aug 2011 09:03:35 +1000, > Nick Coghlan a ?crit : >> On Thu, Aug 11, 2011 at 4:55 AM, Brian Curtin >> wrote: >> > Now that we have concurrent.futures, is there any plan for >> > multiprocessing to follow suit? PEP 3148 mentions a hope to add or move >> > things in the future [0], which would be now. >> >> As Jesse said, moving multiprocessing or threading wholesale was never >> part of the plan. The main motivator of that comment in PEP 3148 was >> the idea of creating 'concurrent.pool', which would provide a >> concurrent worker pool API modelled on multiprocessing.Pool that >> supported either threads or processes as the back end, just like the >> executor model in concurrent.futures. > > Executors *are* pools, so I don't know what you're talking about. Yes, that's the point. A developer shouldn't be forced into using a particular invocation model (i.e. futures) just to get thread or process pool functionality - the pool should be a lower layer building block that's provided separately. As you say, though, nobody has stepped up for the task of actually defining that common, lower level interface. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From yoavglazner at gmail.com Thu Aug 11 11:58:06 2011 From: yoavglazner at gmail.com (yoav glazner) Date: Thu, 11 Aug 2011 12:58:06 +0300 Subject: [Python-Dev] Moving forward with the concurrent package In-Reply-To: References: <20110811090747.10643b8d@msiwind> Message-ID: On Thu, Aug 11, 2011 at 10:56 AM, Nick Coghlan wrote: > On Thu, Aug 11, 2011 at 5:07 PM, Antoine Pitrou > wrote: > > Le Thu, 11 Aug 2011 09:03:35 +1000, > > Nick Coghlan a ?crit : > >> On Thu, Aug 11, 2011 at 4:55 AM, Brian Curtin > >> wrote: > >> > Now that we have concurrent.futures, is there any plan for > >> > multiprocessing to follow suit? PEP 3148 mentions a hope to add or > move > >> > things in the future [0], which would be now. > >> > >> As Jesse said, moving multiprocessing or threading wholesale was never > >> part of the plan. The main motivator of that comment in PEP 3148 was > >> the idea of creating 'concurrent.pool', which would provide a > >> concurrent worker pool API modelled on multiprocessing.Pool that > >> supported either threads or processes as the back end, just like the > >> executor model in concurrent.futures. > > > > Executors *are* pools, so I don't know what you're talking about. > Also the Pool from multiprocessing "works" for threads and process: from multiprocessing.pool import Pool as ProcessPool from multiprocessing.dummy import Pool as ThreadPool -------------- next part -------------- An HTML attachment was scrubbed... URL: From merwok at netwok.org Thu Aug 11 16:33:51 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Thu, 11 Aug 2011 16:33:51 +0200 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: <4E42F0F0.1070203@gmail.com> References: <4E42E24B.8020601@udel.edu> <4E42F0F0.1070203@gmail.com> Message-ID: <4E43E84F.3050309@netwok.org> Hi, > IIRC the reason why we don't do it on 2.x is because we don't have the > 'source' directive available in Sphinx and therefore we would have to > update all the links manually to link to h.p.o instead of s.p.o. In 3.2 and higher, there is a custom source role in Doc/tools/sphinxext/pyspecific.py. For 2.7, I volunteered to change all links manually (sed being, as usual, my friend) but just lacked time. Cheers From merwok at netwok.org Thu Aug 11 16:36:00 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Thu, 11 Aug 2011 16:36:00 +0200 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: References: Message-ID: <4E43E8D0.40201@netwok.org> Hi, I?ve read the latest version of this PEP, as updated by Nick Coghlan in the Mercurial repo on July, 20th. Excuse me if I repeat old arguments, I did not reread all the threads. In summary, I don?t think the PEP is useful right now, nor that it will set a good practice for the future. > * Unix-like software distributions (including systems like Mac OS X and Minor: I call these ?operating systems?. > * The Python 2.x ``idle``, ``pydoc``, and ``python-config`` commands should > likewise be available as ``idle2``, ``pydoc2``, and ``python2-config``, > with the original commands invoking these versions by default, but possibly > invoking the Python 3.x versions instead if configured to do so by the > system administrator. This item ignores that on some OSes, defining the default Python version is not a decision made by the sysadmin. The example I know is Debian (and derivatives): despite what one can read on the Web, it is not a good idea to change /usr/bin/python to point to the version you want; the decision affects all scripts used by the system itself, and is thus the call of the Debian Python maintainers. (FTR, Debian developers discussed adding /usr/bin/python2 at the latest DebConf and rejected it; I don?t know if the arguments raised are the same as mine, but maybe Piotr or someone else will chime in in this thread.) > This is needed as, even though the majority of distributions still alias the > ``python`` command to Python 2, some now alias it to Python 3. Some of > the former also do not provide a ``python2`` command; hence, there is > currently no way for Python 2 code (or any code that invokes the Python 2 > interpreter directly rather than via ``sys.executable``) to reliably run on > all Unix-like systems without modification, as the ``python`` command will > invoke the wrong interpreter version on some systems, and the ``python2`` > command will fail completely on others. The recommendations in this PEP > provide a very simple mechanism to restore cross-platform support, with > minimal additional work required on the part of distribution maintainers. I would like more data about this. How many OSes have moved their python executable to python2? How much people does that impact? Right now I think that there?s only Arch and Gentoo, which I would call minority platforms. (I?m aware that all UNIX-like free operating systems could be considered a minority OS all together, but we?re talking about UNIX-like OSes here :) Doing what the majority does is not always a good thing, but for this PEP I think that numbers can help us assess whether the trouble/benefit ratio is worth it. In my opinion, the current situation is clear: python is some python2.y, python3 is a python3.y, this is not ambiguous and will still work in ten years when we get python4. Thus, the previous decision of python-dev to use python3 indefinitely seems good to me. As a script/program author, if I use python2 in my shebangs now to support what appears to be minority platforms, I?m breaking compatibility with a huge number of systems. Therefore, I don?t see how this PEP makes the situation better. If one OS wants to change the meaning of the python command, then its packaging tools should adapt shebangs, and its users should be aware that the majority of existing Python 3 scripts will break. Therefore, I?m strongly -1 on this PEP: changing the meaning of python brings much trouble for little or no benefit, and adding python2 adds another compatibility trouble. It would be interesting to have feedback from people who lived the transition to Python 2. > * The ``pythonX.X`` (e.g. ``python2.6``) commands exist on some systems, on > which they invoke specific minor versions of the Python interpreter. It > can be useful for distribution-specific packages to take advantage of these > utilities if they exist, since it will prevent code breakage if the default > minor version of a given major version is changed. However, scripts > intending to be cross-platform should not rely on the presence of these > utilities, but rather should be tested on several recent minor versions of > the target major version, compensating, if necessary, for the small > differences that exist between minor versions. This prevents the need for > sysadmins to install many very similar versions of the interpreter. Here again I would be interested in more numbers. Pythons that people manually download and install using the provided makefile do have these pythonx.y executables, so I thought that all OSes did likewise. Moreover, I disagree about the implied assertion that the minor number hardly matters (I?m paraphrasing): Python 2.6 and 2.7 *are* different, not ?very similar?. I don?t know very well the usages of the community, but in my experience moving from 2.x to 2.x+1, or even just checking that your code still works, is a Big Deal. I?d like this whole bullet item to be removed. > Impact on PYTHON* Environment Variables I think this section should be named PYTHONPATH, as it is the only envvar that it talks about. Another minor edit: s/folder/directory/ Regards From merwok at netwok.org Thu Aug 11 16:39:34 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Thu, 11 Aug 2011 16:39:34 +0200 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning Message-ID: <4E43E9A6.7020608@netwok.org> Hi, I?ve read PEP 402 and would like to offer comments. I know a bit about the import system, but not down to the nitty-gritty details of PEP 302 and __path__ computations and all this fun stuff (by which I mean, not fun at all). As such, I can?t find nasty issues in dark corners, but I can offer feedback as a user. I think it?s a very well-written explanation of a very useful feature: +1 from me. If it is accepted, the docs will certainly be much more concise, but the PEP as a thought process is a useful document to read. > When new users come to Python from other languages, they are often > confused by Python's packaging semantics. Minor: I would reserve ?packaging? for packaging/distribution/installation/deployment matters, not Python modules. I suggest ?Python package semantics?. > On the negative side, however, it is non-intuitive for beginners, and > requires a more complex step to turn a module into a package. If > ``Foo`` begins its life as ``Foo.py``, then it must be moved and > renamed to ``Foo/__init__.py``. Minor: In the UNIX world, or with version control tools, moving and renaming are the same one thing (hg mv spam.py spam/__init__.py for example). Also, if you turn a module into a package, you may want to move code around, change imports, etc., so I?m not sure the renaming part is such a big step. Anyway, if the import-sig people say that users think it?s a complex or costly operation, I can believe it. > (By the way, both of these additions to the import protocol (i.e. the > dynamically-added ``__path__``, and dynamically-created modules) > apply recursively to child packages, using the parent package's > ``__path__`` in place of ``sys.path`` as a basis for generating a > child ``__path__``. This means that self-contained and virtual > packages can contain each other without limitation, with the caveat > that if you put a virtual package inside a self-contained one, it's > gonna have a really short ``__path__``!) I don?t understand the caveat or its implications. > In other words, we don't allow pure virtual packages to be imported > directly, only modules and self-contained packages. (This is an > acceptable limitation, because there is no *functional* value to > importing such a package by itself. After all, the module object > will have no *contents* until you import at least one of its > subpackages or submodules!) > > Once ``zc.buildout`` has been successfully imported, though, there > *will* be a ``zc`` module in ``sys.modules``, and trying to import it > will of course succeed. We are only preventing an *initial* import > from succeeding, in order to prevent false-positive import successes > when clashing subdirectories are present on ``sys.path``. I find that limitation acceptable. After all, there is no zc project, and no zc module, just a zc namespace. I?ll just regret that it?s not possible to provide a module docstring to inform that this is a namespace package used for X and Y. > The resulting list (whether empty or not) is then stored in a > ``sys.virtual_package_paths`` dictionary, keyed by module name. This was probably said on import-sig, but here I go: yet another import artifact in the sys module! I hope we get ImportEngine in 3.3 to clean up all this. > * A new ``extend_virtual_paths(path_entry)`` function, to extend > existing, already-imported virtual packages' ``__path__`` attributes > to include any portions found in a new ``sys.path`` entry. This > function should be called by applications extending ``sys.path`` > at runtime, e.g. when adding a plugin directory or an egg to the > path. Let?s imagine my application Spam has a namespace spam.ext for plugins. To use a custom directory where plugins are stored, or a zip file with plugins (I don?t use eggs, so let me talk about zip files here), I?d have to call sys.path.append *and* pkgutil.extend_virtual_paths? > * ``ImpImporter.iter_modules()`` should be changed to also detect and > yield the names of modules found in virtual packages. Is there any value in providing an argument to get the pre-PEP behavior? Or to look at it from a different place, how can Python code know that some module is a virtual or pure virtual package, if that is even a useful thing to know? > Last, but not least, the ``imp`` module (or ``importlib``, if > appropriate) should expose the algorithm described in the `virtual > paths`_ section above, as a > ``get_virtual_path(modulename, parent_path=None)`` function, so that > creators of ``__import__`` replacements can use it. If I?m not mistaken, the rule of thumb these days is that imp is edited when it?s absolutely necessary, otherwise code goes into importlib (more easily written, read and maintained). I wonder if importlib.import_module could implement the new import semantics all by itself, so that we can benefit from this PEP in older Pythons (importlib is on PyPI). > * If you are changing a currently self-contained package into a > virtual one, it's important to note that you can no longer use its > ``__file__`` attribute to locate data files stored in a package > directory. Instead, you must search ``__path__`` or use the > ``__file__`` of a submodule adjacent to the desired files, or > of a self-contained subpackage that contains the desired files. Wouldn?t pkgutil.get_data help here? Besides, putting data files in a Python package is held very poorly by some (mostly people following the File Hierarchy Standard), and in distutils2/packaging, we (will) have a resources system that?s as convenient for users and more flexible for OS packagers. Using __file__ for more than information on the module is frowned upon for other reasons anyway (I talked about a Debian developer about this one day but forgot), so I think the limitation is okay. > * XXX what is the __file__ of a "pure virtual" package? ``None``? > Some arbitrary string? The path of the first directory with a > trailing separator? No matter what we put, *some* code is > going to break, but the last choice might allow some code to > accidentally work. Is that good or bad? A pure virtual package having no source file, I think it should have no __file__ at all. I don?t know if that would break more code than using an empty string for example, but it feels righter. > For those implementing PEP \302 importer objects: Minor: Here I think a link would not be a nuisance (IOW remove the backslash). Regards From brian.curtin at gmail.com Thu Aug 11 16:43:44 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Thu, 11 Aug 2011 09:43:44 -0500 Subject: [Python-Dev] cpython: News item for #12724 In-Reply-To: References: Message-ID: On Thu, Aug 11, 2011 at 00:26, Georg Brandl wrote: > Am 11.08.2011 03:34, schrieb brian.curtin: > > http://hg.python.org/cpython/rev/3a6782f2a4a8 > > changeset: 71811:3a6782f2a4a8 > > user: Brian Curtin > > date: Wed Aug 10 20:32:10 2011 -0500 > > summary: > > News item for #12724 > > > > files: > > Misc/NEWS | 2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > If it gets a NEWS entry, why isn't it documented? I left it out just to see if you were paying attention :) Now that I got caught, added in http://hg.python.org/cpython/rev/e88362fb4950 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandro.tosi at gmail.com Thu Aug 11 16:47:17 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Thu, 11 Aug 2011 16:47:17 +0200 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: <4E43E84F.3050309@netwok.org> References: <4E42E24B.8020601@udel.edu> <4E42F0F0.1070203@gmail.com> <4E43E84F.3050309@netwok.org> Message-ID: On Thu, Aug 11, 2011 at 16:33, ?ric Araujo wrote: > Hi, > >> IIRC the reason why we don't do it on 2.x is because we don't have the >> 'source' directive available in Sphinx and therefore we would have to >> update all the links manually to link to h.p.o instead of s.p.o. > > In 3.2 and higher, there is a custom source role in > Doc/tools/sphinxext/pyspecific.py. ?For 2.7, I volunteered to change all > links manually (sed being, as usual, my friend) but just lacked time. Is there a reason we can't use the same sphinx role in 2.7 too? And also the same sphinx (thus sphinxext) versions on 2.7 and 3.x? that would probably help in keeping the diffs on the documentation smaller. Regards, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From wok at no-log.org Thu Aug 11 17:01:21 2011 From: wok at no-log.org (Merwok) Date: Thu, 11 Aug 2011 17:01:21 +0200 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: References: <4E42E24B.8020601@udel.edu> <4E42F0F0.1070203@gmail.com> <4E43E84F.3050309@netwok.org> Message-ID: <4E43EEC1.9060000@no-log.org> Le 11/08/2011 16:47, Sandro Tosi a ?crit : > Is there a reason we can't use the same sphinx role in 2.7 too? And > also the same sphinx (thus sphinxext) versions on 2.7 and 3.x? that > would probably help in keeping the diffs on the documentation smaller. Even though the pyspecific module is wholly private and used only for our build process, Georg seems to follow the rule that we don?t add new features in stable branches. I think that?s why the new role was added in 3.2 when in was in dev phase but not to 2.7 (see #10334). We also use different versions of Sphinx. Regards From rdmurray at bitdance.com Thu Aug 11 17:00:21 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 11 Aug 2011 11:00:21 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <4E43E8D0.40201@netwok.org> References: <4E43E8D0.40201@netwok.org> Message-ID: <20110811150022.02A192505A7@webabinitio.net> I think you missed the point of the PEP. The point is to create a new, python-dev-blessed standard that the distros will follow. The primary goal is so that a script can specify python2 or python3 in the #! line and expect that to work on all compliant linux systems, which we hope will be all of them. Everything else is just details. And yes, that distinction is much more important than the distinction between minor version numbers. That's the whole point of python3, after all. -- R. David Murray http://www.bitdance.com From merwok at netwok.org Thu Aug 11 17:12:22 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Thu, 11 Aug 2011 17:12:22 +0200 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <20110811150022.02A192505A7@webabinitio.net> References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> Message-ID: <4E43F156.8040008@netwok.org> Hi Devid, > I think you missed the point of the PEP. The point is to create a new, > python-dev-blessed standard that the distros will follow. The primary > goal is so that a script can specify python2 or python3 in the #! > line and expect that to work on all compliant linux systems, which we > hope will be all of them. Everything else is just details. I?m sorry if my opinion on that main point was lost among remarks on details. To rephrase one part of my reply: Right now, the de facto standard is that shebangs can use python to mean python2 and python3 to mean python3. Adding python2 to that and supporting making python ambiguous seems harmful to me. Regards From barry at python.org Thu Aug 11 17:39:52 2011 From: barry at python.org (Barry Warsaw) Date: Thu, 11 Aug 2011 11:39:52 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <4E43E9A6.7020608@netwok.org> References: <4E43E9A6.7020608@netwok.org> Message-ID: <20110811113952.2e257351@resist.wooz.org> On Aug 11, 2011, at 04:39 PM, ?ric Araujo wrote: >> * XXX what is the __file__ of a "pure virtual" package? ``None``? >> Some arbitrary string? The path of the first directory with a >> trailing separator? No matter what we put, *some* code is >> going to break, but the last choice might allow some code to >> accidentally work. Is that good or bad? >A pure virtual package having no source file, I think it should have no >__file__ at all. I don?t know if that would break more code than using >an empty string for example, but it feels righter. I agree that the empty string is the worst of the choices. no __file__ or __file__=None is better. -Barry From glyph at twistedmatrix.com Thu Aug 11 20:02:59 2011 From: glyph at twistedmatrix.com (Glyph Lefkowitz) Date: Thu, 11 Aug 2011 14:02:59 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <20110811113952.2e257351@resist.wooz.org> References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> Message-ID: <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> On Aug 11, 2011, at 11:39 AM, Barry Warsaw wrote: > On Aug 11, 2011, at 04:39 PM, ?ric Araujo wrote: > >>> * XXX what is the __file__ of a "pure virtual" package? ``None``? >>> Some arbitrary string? The path of the first directory with a >>> trailing separator? No matter what we put, *some* code is >>> going to break, but the last choice might allow some code to >>> accidentally work. Is that good or bad? >> A pure virtual package having no source file, I think it should have no >> __file__ at all. I don?t know if that would break more code than using >> an empty string for example, but it feels righter. > > I agree that the empty string is the worst of the choices. no __file__ or > __file__=None is better. In some sense, I agree: hacks like empty strings are likely to lead to path-manipulation bugs where the wrong file gets opened (or worse, deleted, with predictable deleterious effects). But the whole "pure virtual" mechanism here seems to pile even more inconsistency on top of an already irritatingly inconsistent import mechanism. I was reasonably happy with my attempt to paper over PEP 302's weirdnesses from a user perspective: http://twistedmatrix.com/documents/11.0.0/api/twisted.python.modules.html (or https://launchpad.net/modules if you are not a Twisted user) Users of this API can traverse the module hierarchy with certain expectations; each module or package would have .pathEntry and .filePath attributes, each of which would refer to the appropriate place. Of course __path__ complicates things a bit, but so it goes. Now it seems like pure virtual packages are going to introduce a new type of special case into the hierarchy which have neither .pathEntry nor .filePath objects. Rather than a one-by-one ad-hoc consideration of which attribute should be set to None or empty strings or "" or what have you, I'd really like to see a discussion in the PEP saying what a package really is vs. what a module is, and what one can reasonably expect from it from an API and tooling perspective. Right now I have to puzzle out the intent of the final API from the problem/solution description and thought experiment. Despite authoring several namespace packages myself, I don't have any of the problems described in the PEP. I just want to know how to write correct tools given this new specification. I suspect that this PEP will be the only reference for how packages work for a long time coming (just as PEP 302 was before it) so it should really get this right. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Thu Aug 11 20:12:41 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 11 Aug 2011 20:12:41 +0200 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> Message-ID: <20110811201241.32c4348c@pitrou.net> On Thu, 11 Aug 2011 11:39:52 -0400 Barry Warsaw wrote: > On Aug 11, 2011, at 04:39 PM, ?ric Araujo wrote: > > >> * XXX what is the __file__ of a "pure virtual" package? ``None``? > >> Some arbitrary string? The path of the first directory with a > >> trailing separator? No matter what we put, *some* code is > >> going to break, but the last choice might allow some code to > >> accidentally work. Is that good or bad? > >A pure virtual package having no source file, I think it should have no > >__file__ at all. I don?t know if that would break more code than using > >an empty string for example, but it feels righter. > > I agree that the empty string is the worst of the choices. no __file__ or > __file__=None is better. None should be the answer. It simplifies inspection of module data (repr(__file__) gives you something recognizable instead of raising) and makes semantically sense (!) since there is, indeed, no actual file backing the module. Regards Antoine. From g.brandl at gmx.net Thu Aug 11 20:22:35 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Thu, 11 Aug 2011 20:22:35 +0200 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix closes Issue12722 - link heapq source in the text format in the In-Reply-To: <4E43EEC1.9060000@no-log.org> References: <4E42E24B.8020601@udel.edu> <4E42F0F0.1070203@gmail.com> <4E43E84F.3050309@netwok.org> <4E43EEC1.9060000@no-log.org> Message-ID: Am 11.08.2011 17:01, schrieb Merwok: > Le 11/08/2011 16:47, Sandro Tosi a ?crit : >> Is there a reason we can't use the same sphinx role in 2.7 too? And >> also the same sphinx (thus sphinxext) versions on 2.7 and 3.x? that >> would probably help in keeping the diffs on the documentation smaller. > > Even though the pyspecific module is wholly private and used only for > our build process, Georg seems to follow the rule that we don?t add new > features in stable branches. I think that?s why the new role was added > in 3.2 when in was in dev phase but not to 2.7 (see #10334). I think I just put it in default as a test, and intended to backport it later when it proved useful. You're welcome to do so now. > We also use different versions of Sphinx. That doesn't matter for this role. Georg From pje at telecommunity.com Thu Aug 11 20:30:51 2011 From: pje at telecommunity.com (P.J. Eby) Date: Thu, 11 Aug 2011 14:30:51 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <4E43E9A6.7020608@netwok.org> References: <4E43E9A6.7020608@netwok.org> Message-ID: <20110811183114.701DF3A406B@sparrow.telecommunity.com> At 04:39 PM 8/11/2011 +0200, ??ric Araujo wrote: >Hi, > >I've read PEP 402 and would like to offer comments. Thanks. >Minor: I would reserve "packaging" for >packaging/distribution/installation/deployment matters, not Python >modules. I suggest "Python package semantics". Changing to "Python package import semantics" to hopefully be even clearer. ;-) (Nitpick: I was somewhat intentionally ambiguous because we are talking here about how a package is physically implemented in the filesystem, and that actually *is* kind of a packaging issue. But it's not necessarily a *useful* intentional ambiguity, so I've no problem with removing it.) >Minor: In the UNIX world, or with version control tools, moving and >renaming are the same one thing (hg mv spam.py spam/__init__.py for >example). Also, if you turn a module into a package, you may want to >move code around, change imports, etc., so I'm not sure the renaming >part is such a big step. Anyway, if the import-sig people say that >users think it's a complex or costly operation, I can believe it. It's not that it's complex or costly in anything other than *mental* overhead -- you have to remember to do it and it's not particularly obvious. (But people on import-sig did mention this and other things covered by the PEP as being a frequent root cause of beginner inquiries on #python, Stackoverflow, et al.) > > (By the way, both of these additions to the import protocol (i.e. the > > dynamically-added ``__path__``, and dynamically-created modules) > > apply recursively to child packages, using the parent package's > > ``__path__`` in place of ``sys.path`` as a basis for generating a > > child ``__path__``. This means that self-contained and virtual > > packages can contain each other without limitation, with the caveat > > that if you put a virtual package inside a self-contained one, it's > > gonna have a really short ``__path__``!) >I don't understand the caveat or its implications. Since each package's __path__ is the same length or shorter than its parent's by default, then if you put a virtual package inside a self-contained one, it will be functionally speaking no different than a self-contained one, in that it will have only one path entry. So, it's not really useful to put a virtual package inside a self-contained one, even though you can do it. (Apart form it letting you avoid a superfluous __init__ module, assuming it's indeed superfluous.) > > In other words, we don't allow pure virtual packages to be imported > > directly, only modules and self-contained packages. (This is an > > acceptable limitation, because there is no *functional* value to > > importing such a package by itself. After all, the module object > > will have no *contents* until you import at least one of its > > subpackages or submodules!) > > > > Once ``zc.buildout`` has been successfully imported, though, there > > *will* be a ``zc`` module in ``sys.modules``, and trying to import it > > will of course succeed. We are only preventing an *initial* import > > from succeeding, in order to prevent false-positive import successes > > when clashing subdirectories are present on ``sys.path``. >I find that limitation acceptable. After all, there is no zc project, >and no zc module, just a zc namespace. I'll just regret that it's not >possible to provide a module docstring to inform that this is a >namespace package used for X and Y. It *is* possible - you'd just have to put it in a "zc.py" file. IOW, this PEP still allows "namespace-defining packages" to exist, as was requested by early commenters on PEP 382. It just doesn't *require* them to exist in order for the namespace contents to be importable. > > The resulting list (whether empty or not) is then stored in a > > ``sys.virtual_package_paths`` dictionary, keyed by module name. >This was probably said on import-sig, but here I go: yet another import >artifact in the sys module! I hope we get ImportEngine in 3.3 to clean >up all this. Well, I rather *like* having them there, personally, vs. having to learn yet another API, but oh well, whatever. AFAIK, ImportEngine isn't going to do away with the need for the global ones to live somewhere, at least not in 3.3. > > * A new ``extend_virtual_paths(path_entry)`` function, to extend > > existing, already-imported virtual packages' ``__path__`` attributes > > to include any portions found in a new ``sys.path`` entry. This > > function should be called by applications extending ``sys.path`` > > at runtime, e.g. when adding a plugin directory or an egg to the > > path. >Let's imagine my application Spam has a namespace spam.ext for plugins. > To use a custom directory where plugins are stored, or a zip file with >plugins (I don't use eggs, so let me talk about zip files here), I'd >have to call sys.path.append *and* pkgutil.extend_virtual_paths? As written in the current proposal, yes. There was some discussion on Python-Dev about having this happen automatically, and I proposed that it could be done by making virtual packages' __path__ attributes an iterable proxy object, rather than a list: http://mail.python.org/pipermail/python-dev/2011-July/112429.html (This is an open option that hasn't been added to the PEP as yet, because I wanted to know Guido's thoughts on the proposal as it stands before burdening it with more implementation detail for a feature (automatic updates) that he might not be very keen on to begin with, even it does make the semantics that much more familiar for Perl or PHP users.) > > * ``ImpImporter.iter_modules()`` should be changed to also detect and > > yield the names of modules found in virtual packages. >Is there any value in providing an argument to get the pre-PEP behavior? > Or to look at it from a different place, how can Python code know that >some module is a virtual or pure virtual package, if that is even a >useful thing to know? Is it a useful thing? Dunno. That's why it's open for comment. If the auto-update approach is used, then the __path__ of virtual packages will have a distinguishable type(). > > Last, but not least, the ``imp`` module (or ``importlib``, if > > appropriate) should expose the algorithm described in the `virtual > > paths`_ section above, as a > > ``get_virtual_path(modulename, parent_path=None)`` function, so that > > creators of ``__import__`` replacements can use it. >If I'm not mistaken, the rule of thumb these days is that imp is edited >when it's absolutely necessary, otherwise code goes into importlib (more >easily written, read and maintained). > >I wonder if importlib.import_module could implement the new import >semantics all by itself, so that we can benefit from this PEP in older >Pythons (importlib is on PyPI). AFAIK, *that* importlib doesn't include a reimplementation of the full import process, though I suppose I could be wrong. My personal plan was just to create a specific pep382 module to include with future versions of setuptools, but as things worked out, I'm not sure if that'll be sanely doable for pep402. > > * If you are changing a currently self-contained package into a > > virtual one, it's important to note that you can no longer use its > > ``__file__`` attribute to locate data files stored in a package > > directory. Instead, you must search ``__path__`` or use the > > ``__file__`` of a submodule adjacent to the desired files, or > > of a self-contained subpackage that contains the desired files. >Wouldn't pkgutil.get_data help here? Not so long as you passed it a package name instead of a module name. This issue exists today with namespace pacakges; it's not new to virtual packages. >Besides, putting data files in a Python package is held very poorly by >some (mostly people following the File Hierarchy Standard), ISTM that anybody who thinks that is being inconsistent in considering the Python code itself to not be a "data file" by that same criterion... especially since one of the more common uses for such "data" files are for e.g. HTML templates (which usually contain some sort of code) or GUI resources (which are pretty tightly bound to the code). Are those same people similarly concerned when a Firefox extension contains image files as well as JavaScript? And if not, why is Python different? IOW, I think that those people are being confused by our use of the term "data" and thus think of it as an entirely different sort of "data" than what is meant by "package data" in the Python world. I am not sure what word would unconfuse (defuse?) them, but we simply mean "files that are part of the package but are not of a type that Python can import by default," not "user-modifiable data" or "data that has meaning or usefulness to code other than the code it was packaged with." Perhaps "package-embedded resources" would be a better phrase? Certainly, it implies that they're *supposed* to be embedded there. ;-) > > * XXX what is the __file__ of a "pure virtual" package? ``None``? > > Some arbitrary string? The path of the first directory with a > > trailing separator? No matter what we put, *some* code is > > going to break, but the last choice might allow some code to > > accidentally work. Is that good or bad? >A pure virtual package having no source file, I think it should have no >__file__ at all. I don't know if that would break more code than using >an empty string for example, but it feels righter. > > > For those implementing PEP \302 importer objects: >Minor: Here I think a link would not be a nuisance (IOW remove the >backslash). Done. From rdmurray at bitdance.com Thu Aug 11 20:31:33 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 11 Aug 2011 14:31:33 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <4E43F156.8040008@netwok.org> References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> <4E43F156.8040008@netwok.org> Message-ID: <20110811183133.D56E72505A7@webabinitio.net> On Thu, 11 Aug 2011 17:12:22 +0200, =?UTF-8?B?w4lyaWMgQXJhdWpv?= wrote: > I???m sorry if my opinion on that main point was lost among remarks on > details. To rephrase one part of my reply: Right now, the de facto > standard is that shebangs can use python to mean python2 and python3 to > mean python3. Adding python2 to that and supporting making python > ambiguous seems harmful to me. OK. So you are -1 on the PEP. I'm a big +1. To address your argument briefly, *now* a minority of distros have python pointing to python2. We expect this to change. It may not happen for 5 years, but someday it will. So this PEP is about preparing for the future. Given that, I fail to see what harm having an additional symlink named python2 will do. And yes this was argued about earlier and should (in theory at least) be addressed by the PEP, which is why I'm concluding that you are -1 on the PEP :). -- R. David Murray http://www.bitdance.com From sturla at molden.no Thu Aug 11 21:11:11 2011 From: sturla at molden.no (Sturla Molden) Date: Thu, 11 Aug 2011 21:11:11 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: References: Message-ID: <4E44294F.5060005@molden.no> Den 09.08.2011 11:33, skrev ???? ?????????: > Probably I want to re-invent a bicycle. I want developers to say me > why we can not remove GIL in that way: > > 1. Remove GIL completely with all current logick. > 2. Add it's own RW-locking to all mutable objects (like list or dict) > 3. Add RW-locks to every context instance > 4. use RW-locks when accessing members of object instances > > Only one reason, I see, not do that -- is performance of > singlethreaded applications. Why not to fix locking functions for this > 4 cases to stubs when only one thread present? This has been discussed to death before, and is probably OT to this list. There is another reason than speed of single-threaded applications, but it is rather technical: As CPython uses reference counting for garbage collection, we would get "false sharing" of reference counts -- which would work as an "invisible GIL" (synchronization bottleneck) anyway. That is, if one processor writes to memory in a cache-line shared by another processor, they must stop whatever they are doing to synchronize the dirty cache lines with RAM. Thus, updating reference counts would flood the memory bus with traffic and be much worse than the GIL. Instead of doing useful work, the processors would be stuck synchronizing dirty cache lines. You can think of it as a severe traffic jam. To get rid of the GIL, CPython would either need (a) another GC method (e.g. similar to .NET or Java) or (b) another threading model (e.g. one interpreter per thread, as in Tcl, Erlang, or .NET app domains). As CPython has neither, we are better off with the GIL. Nobody likes the GIL, fork a project to write a GIL free CPython if you can. But note that: 1. With Cython, you have full manual control over the GIL. IronPython and Jython does not have a GIL at all. 2. Much of the FUD against the GIL is plain ignorance: The GIL slows down parallel computational code, but any serious number crunching should use numerical performance libraries (i.e. C extensions) anyway. Libraries are free to release the GIL or spawn threads internally. Also, the GIL does not matter for (a) I/O bound code such as network servers or clients and (b) background threads in GUI programs -- which are the two common use-cases for threads in Python programs. If the GIL bites you, it's most likely a warning that your program is badly written, independent of the GIL issue. There seems to be a common misunderstanding that Python threads work like fibers due to they GIL. They do not! Python threads are native OS threads and can do anything a thread can do, including executing library code in parallel. If one thread is blocking on I/O, the other threads can continue with their business. The only thing Python threads cannot do is access the Python interpreter concurrently. And the reason CPython needs that restriction is reference counting. Sturla From victor.stinner at haypocalc.com Thu Aug 11 21:31:56 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 11 Aug 2011 21:31:56 +0200 Subject: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter) In-Reply-To: References: <4E308D63.9090901@haypocalc.com> <4E3125D7.2030103@egenix.com> <4E312BCB.3080301@haypocalc.com> <20110729171731.2059cc3e@pitrou.net> Message-ID: <4E442E2C.4050700@haypocalc.com> Le 29/07/2011 19:01, Guido van Rossum a ?crit : >>>> I will add your alternative to the PEP (except if you would like to do >>>> that yourself?). If I understood correctly, you propose to: >>>> >>>> * rename codecs.open() to codecs.open_stream() >>>> * change codecs.open() to reuse open() (and so io.TextIOWrapper) (...) > > +1 Ok, most people prefer this option. Should I modify the PEP to "move" this option has the first/main proposition (move my proposition as an alternative?), or can the PEP be validated in the current state? Victor From tjreedy at udel.edu Fri Aug 12 00:05:04 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 11 Aug 2011 18:05:04 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <4E43E8D0.40201@netwok.org> References: <4E43E8D0.40201@netwok.org> Message-ID: On 8/11/2011 10:36 AM, ?ric Araujo wrote: > It would be interesting to have feedback from people who lived the > transition to Python 2. There was no comparable transition. Python 2.0 was basically 1.6 renamed for a different distributor. I regard Python 2.2, which introduced new-style, as the beginning of Python 2 as something significantly different from Python 1. I suppose one could also point to the earlier intro of unicode. The new iterator protocol was also a major change. In any case, back compatibility was kept in all three respects (and others) until Python 3. -- Terry Jan Reedy From tjreedy at udel.edu Fri Aug 12 00:21:20 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 11 Aug 2011 18:21:20 -0400 Subject: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter) In-Reply-To: <4E442E2C.4050700@haypocalc.com> References: <4E308D63.9090901@haypocalc.com> <4E3125D7.2030103@egenix.com> <4E312BCB.3080301@haypocalc.com> <20110729171731.2059cc3e@pitrou.net> <4E442E2C.4050700@haypocalc.com> Message-ID: On 8/11/2011 3:31 PM, Victor Stinner wrote: > Le 29/07/2011 19:01, Guido van Rossum a ?crit : >>>>> I will add your alternative to the PEP (except if you would like to do >>>>> that yourself?). If I understood correctly, you propose to: >>>>> >>>>> * rename codecs.open() to codecs.open_stream() >>>>> * change codecs.open() to reuse open() (and so io.TextIOWrapper) > (...) >> >> +1 > > Ok, most people prefer this option. Should I modify the PEP to "move" > this option has the first/main proposition (move my proposition as an > alternative?), or can the PEP be validated in the current state? I would relabel the above as the Minimal Change Alternative or M.A.L. alternative or whatever and possibly move it but in any case note that Guido (and others) accepted that alternative with consideration of more drastic changes deferred to later. And add an explicit reference to the email you quoted. -- Terry Jan Reedy From ncoghlan at gmail.com Fri Aug 12 01:36:32 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 12 Aug 2011 09:36:32 +1000 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <20110811183114.701DF3A406B@sparrow.telecommunity.com> References: <4E43E9A6.7020608@netwok.org> <20110811183114.701DF3A406B@sparrow.telecommunity.com> Message-ID: On Fri, Aug 12, 2011 at 4:30 AM, P.J. Eby wrote: > At 04:39 PM 8/11/2011 +0200, ??ric Araujo wrote: >> > The resulting list (whether empty or not) is then stored in a >> > ``sys.virtual_package_paths`` dictionary, keyed by module name. >> This was probably said on import-sig, but here I go: yet another import >> artifact in the sys module! ?I hope we get ImportEngine in 3.3 to clean >> up all this. > > Well, I rather *like* having them there, personally, vs. having to learn yet > another API, but oh well, whatever. ?AFAIK, ImportEngine isn't going to do > away with the need for the global ones to live somewhere, at least not in > 3.3. And likely not for the entire 3.x series - I shudder at the thought of the backwards incompatibility hell associated with trying to remove them... The point of the ImportEngine API is that the caching elements of the import state introduce cross dependencies between various global data structures. Code that manipulates those data structures needs to correctly invalidate or otherwise update the state as things change. I seem to recall a certain programming construct that is designed to make it easier to manage interdependent data structures... Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Fri Aug 12 05:10:24 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 12 Aug 2011 13:10:24 +1000 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <4E43F156.8040008@netwok.org> References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> <4E43F156.8040008@netwok.org> Message-ID: On Fri, Aug 12, 2011 at 1:12 AM, ?ric Araujo wrote: > I?m sorry if my opinion on that main point was lost among remarks on > details. ?To rephrase one part of my reply: Right now, the de facto > standard is that shebangs can use python to mean python2 and python3 to > mean python3. ?Adding python2 to that and supporting making python > ambiguous seems harmful to me. This PEP comes mainly out of the fact that we collectively think Arch (the case that prompted the original discussion) are making a mistake that will hurt their users in switching the default Python *right now*, so the PEP is first and foremost designed to record that consensus. However, their actions do mean that the 'python' name is *already* ambiguous, no matter what the mainstream distros think. The Debian maintainers may not care about that, but *I* do, as does anyone wanting to write distro-agnostic shebang lines. Given that some distros (large or small), along with some system administrators, are going to want to have python refer to python3, either now or at some point in the future, there are really only two options available to us here: 1. Accept the reality of that situation, and propose a mechanism that minimises the impact of the resulting ambiguity on end users of Python by allowing developers to be explicit about their target language. This is the approach advocated in PEP 394. 2. Tell the Arch developers (and anyone else inclined to point the python name at python3) that they're wrong, and the python symlink should, now and forever, always refer to a version of Python 2.x. It's worth noting that there has never been any previous python-dev consensus to use 'python3' indefinitely - the status quo came about because it makes sense for the moment even to those of us that *want* 'python' to eventually refer to 'python3', so there was no previous need for an explicit choice between the two alternatives. By migrating, Arch has forced us to choose between either supporting their action or else telling Python users "Don't blame us, blame the distros that pointed python at python3" when things break. I flat out disagree with the second approach - having to type 'python3' when a 3.x variant is the only version of Python installed would just be dumb, and I also think playing that kind of blame game in the face of inevitable cross-distro compatibility problems is disrespectful to our users. If you want to get Zen about it, 'practicality beats purity', 'explicit is better than implicit' and 'In the face of ambiguity, refuse the temptation to guess' all come down in favour of the approach in PEP 394. If I haven't persuaded you to adjust your view up to at least a -0 (i.e. don't entirely agree, but won't object to others moving forward with it) and you still wish to advocate for the second approach, then I suggest creating a competing PEP in order to provide a clear alternative proposal (with Guido or his appointed delegate having the final say, as usual) that explains the alternative recommendation for: - distros that have already switched their 'python' links to refer to python3 - Python developers wishing to write shebang lines that work on multiple 2.x versions and support both platforms that define 'python2' and those that only define 'python' FWIW, the closest historical precedent I can recall is Red Hat's issues when users switched the system Python from 1.5 to 2.2+, and the lesson learned from that exercise was that distro installed utilities should always reference a specific Python version rather than relying on the system administrator leaving the 'python' link alone. It sounds like Debian chose not to heed that lesson, which is unfortunate (although, to be honest, I'm not sure how well Fedora/Red Hat heed it, either). However, the commentary in PEP 394 based on that history (i.e. that distros really shouldn't care where the python name points) will remain in place. Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Fri Aug 12 05:14:14 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 12 Aug 2011 13:14:14 +1000 Subject: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter) In-Reply-To: References: <4E308D63.9090901@haypocalc.com> <4E3125D7.2030103@egenix.com> <4E312BCB.3080301@haypocalc.com> <20110729171731.2059cc3e@pitrou.net> <4E442E2C.4050700@haypocalc.com> Message-ID: On Fri, Aug 12, 2011 at 8:21 AM, Terry Reedy wrote: > On 8/11/2011 3:31 PM, Victor Stinner wrote: >> Ok, most people prefer this option. Should I modify the PEP to "move" >> this option has the first/main proposition (move my proposition as an >> alternative?), or can the PEP be validated in the current state? > > I would relabel the above as the Minimal Change Alternative or M.A.L. > alternative or whatever and possibly move it but in any case note that Guido > (and others) accepted that alternative with consideration of more drastic > changes deferred to later. And add an explicit reference to the email you > quoted. Yeah, definitely retitle/rewrite/rearrange to be clear what Guido accepted and then state that any future deprecation of components in the codecs module will be dealt with as a new PEP. Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From vinay_sajip at yahoo.co.uk Fri Aug 12 10:47:30 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Fri, 12 Aug 2011 08:47:30 +0000 (UTC) Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning References: <4E43E9A6.7020608@netwok.org> Message-ID: ?ric Araujo netwok.org> writes: > Besides, putting data files in a Python package is held very poorly by > some (mostly people following the File Hierarchy Standard), and in > distutils2/packaging, we (will) have a resources system that?s as > convenient for users and more flexible for OS packagers. Using __file__ > for more than information on the module is frowned upon for other > reasons anyway (I talked about a Debian developer about this one day but > forgot), so I think the limitation is okay. > The FHS does not apply in all scenarios - not all Python code is deployed/packaged at system level. For example, plug-ins (such as Django apps) are often not meant to be installed by a system-level packager. This might also be true in scenarios where Python is embedded into some other application. It's really useful to be able to co-locate packages with their data (e.g. in a zip file) and I don't think all instances of putting data files in a package are to be frowned upon. Regards, Vinay Sajip From solipsis at pitrou.net Fri Aug 12 12:58:46 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 12 Aug 2011 12:58:46 +0200 Subject: [Python-Dev] PEP 3154 - pickle protocol 4 Message-ID: <20110812125846.00a75cd1@pitrou.net> Hello, This PEP is an attempt to foster a number of small incremental improvements in a future pickle protocol version. The PEP process is used in order to gather as many improvements as possible, because the introduction of a new protocol version should be a rare occurrence. Feel free to suggest any additions. Regards Antoine. http://www.python.org/dev/peps/pep-3154/ PEP: 3154 Title: Pickle protocol version 4 Version: $Revision$ Last-Modified: $Date$ Author: Antoine Pitrou Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2011-08-11 Python-Version: 3.3 Post-History: Resolution: TBD Abstract ======== Data serialized using the pickle module must be portable accross Python versions. It should also support the latest language features as well as implementation-specific features. For this reason, the pickle module knows about several protocols (currently numbered from 0 to 3), each of which appeared in a different Python version. Using a low-numbered protocol version allows to exchange data with old Python versions, while using a high-numbered protocol allows access to newer features and sometimes more efficient resource use (both CPU time required for (de)serializing, and disk size / network bandwidth required for data transfer). Rationale ========= The latest current protocol, coincidentally named protocol 3, appeared with Python 3.0 and supports the new incompatible features in the language (mainly, unicode strings by default and the new bytes object). The opportunity was not taken at the time to improve the protocol in other ways. This PEP is an attempt to foster a number of small incremental improvements in a future new protocol version. The PEP process is used in order to gather as many improvements as possible, because the introduction of a new protocol version should be a rare occurrence. Improvements in discussion ========================== 64-bit compatibility for large objects -------------------------------------- Current protocol versions export object sizes for various built-in types (str, bytes) as 32-bit ints. This forbids serialization of large data [1]_. New opcodes are required to support very large bytes and str objects. Native opcodes for sets and frozensets -------------------------------------- Many common built-in types (such as str, bytes, dict, list, tuple) have dedicated opcodes to improve resource consumption when serializing and deserializing them; however, sets and frozensets don't. Adding such opcodes would be an obvious improvement. Also, dedicated set support could help remove the current impossibility of pickling self-referential sets [2]_. Binary encoding for all opcodes ------------------------------- The GLOBAL opcode, which is still used in protocol 3, uses the so-called "text" mode of the pickle protocol, which involves looking for newlines in the pickle stream. Looking for newlines is difficult to optimize on a non-seekable stream, and therefore a new version of GLOBAL (BINGLOBAL?) could use a binary encoding instead. It seems that all other opcodes emitted when using protocol 3 already use binary encoding. Acknowledgments =============== (...) References ========== .. [1] "pickle not 64-bit ready": http://bugs.python.org/issue11564 .. [2] "Cannot pickle self-referencing sets": http://bugs.python.org/issue9269 Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: From catch-all at masklinn.net Fri Aug 12 14:32:43 2011 From: catch-all at masklinn.net (Xavier Morel) Date: Fri, 12 Aug 2011 14:32:43 +0200 Subject: [Python-Dev] PEP 3154 - pickle protocol 4 In-Reply-To: <20110812125846.00a75cd1@pitrou.net> References: <20110812125846.00a75cd1@pitrou.net> Message-ID: On 2011-08-12, at 12:58 , Antoine Pitrou wrote: > Current protocol versions export object sizes for various built-in types > (str, bytes) as 32-bit ints. This forbids serialization of large data > [1]_. New opcodes are required to support very large bytes and str > objects. How about changing object sizes to be 64b always? Too much overhead for the common case (which might be smaller pickled objects)? Or a slightly more devious scheme (e.g. tag-bit, untagged is 31b size, tagged is 63), which would not require adding opcodes for that? > Also, dedicated set support > could help remove the current impossibility of pickling > self-referential sets [2]_. Is there really no possibility of fix recursive pickling once and for all? Dedicated optcodes for resource consumption purposes (and to match those of other build-in types) is still a good idea, but being able to pickle arbitrary recursive structures would be even better would it not? And if specific (new) opcodes are required to handle recursive pickling correctly, that's the occasion. From solipsis at pitrou.net Fri Aug 12 15:30:09 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 12 Aug 2011 15:30:09 +0200 Subject: [Python-Dev] PEP 3154 - pickle protocol 4 In-Reply-To: References: <20110812125846.00a75cd1@pitrou.net> Message-ID: <1313155809.3603.18.camel@localhost.localdomain> Hello, Le vendredi 12 ao?t 2011 ? 14:32 +0200, Xavier Morel a ?crit : > On 2011-08-12, at 12:58 , Antoine Pitrou wrote: > > Current protocol versions export object sizes for various built-in types > > (str, bytes) as 32-bit ints. This forbids serialization of large data > > [1]_. New opcodes are required to support very large bytes and str > > objects. > How about changing object sizes to be 64b always? Too much overhead for the > common case (which might be smaller pickled objects)? Yes, and also the old opcodes must still be supported, so there's no maintenance gain in not exploiting them. > Or a slightly more > devious scheme (e.g. tag-bit, untagged is 31b size, tagged is 63), which > would not require adding opcodes for that? The opcode space is not full enough to justify this kind of complication, IMO. > > Also, dedicated set support > > could help remove the current impossibility of pickling > > self-referential sets [2]_. > > Is there really no possibility of fix recursive pickling once > and for all? Dedicated optcodes for resource consumption > purposes (and to match those of other build-in types) is > still a good idea, but being able to pickle arbitrary > recursive structures would be even better would it not? That's true. Actually, it seems pickling recursive sets could have worked from the start, if a difference __reduce__ had been chosen and a __setstate__ had been defined: >>> class X: pass ... >>> class myset(set): ... def __reduce__(self): ... return (self.__class__, (), list(self)) ... def __setstate__(self, state): ... self.update(state) >>> m = myset((1,2,3)) >>> x = X() >>> x.m = m >>> m.add(x) >>> mm = pickle.loads(pickle.dumps(m)) >>> m myset({1, 2, 3, <__main__.X object at 0x7fe3635c6990>}) >>> mm myset({1, 2, 3, <__main__.X object at 0x7fe3635c6c30>}) # m has a reference loop >>> [x for x in m if getattr(x, 'm', None) is m] [<__main__.X object at 0x7fe3635c6990>] # mm retains a similar reference loop >>> [x for x in mm if getattr(x, 'm', None) is mm] [<__main__.X object at 0x7fe3635c6c30>] # the representation is roughly as efficient as the original one >>> len(pickle.dumps(set([1,2,3]))) 36 >>> len(pickle.dumps(myset([1,2,3]))) 37 We can't change set.__reduce__ (or __reduce_ex__) without a protocol bump, though, since past Pythons would fail loading the pickles. Regards Antoine. From van.lindberg at gmail.com Fri Aug 12 16:32:23 2011 From: van.lindberg at gmail.com (VanL) Date: Fri, 12 Aug 2011 09:32:23 -0500 Subject: [Python-Dev] GIL removal question In-Reply-To: <4E44294F.5060005@molden.no> References: <4E44294F.5060005@molden.no> Message-ID: On 8/11/2011 2:11 PM, Sturla Molden wrote: > > (b) another threading model (e.g. one interpreter per thread, as in Tcl, > Erlang, or .NET app domains). We are close to this, in that we already have baked-in support for subinterpreters. Out of curiosity, why isn't this being pursued? From pje at telecommunity.com Fri Aug 12 17:24:57 2011 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 12 Aug 2011 11:24:57 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> Message-ID: <20110812152512.112A53A406B@sparrow.telecommunity.com> At 02:02 PM 8/11/2011 -0400, Glyph Lefkowitz wrote: >Rather than a one-by-one ad-hoc consideration of which attribute >should be set to None or empty strings or "" or what have >you, I'd really like to see a discussion in the PEP saying what a >package really is vs. what a module is, and what one can reasonably >expect from it from an API and tooling perspective. The assumption I've been working from is the only guarantee I've ever seen the Python docs give: i.e., that a package is a module object with a __path__ attribute. Modules aren't even required to have a __file__ object -- builtin modules don't, for example. (And the contents of __file__ are not required to have any particular semantics: PEP 302 notes that it can be a dummy value like "", for example.) Technically, btw, PEP 302 requires __file__ to be a string, so making __file__ = None will be a backwards-incompatible change. But any code that walks modules in sys.modules is going to break today if it expects a __file__ attribute to exist, because 'sys' itself doesn't have one! So, my leaning is towards leaving off __file__, since today's code already has to deal with it being nonexistent, if it's working with arbitrary modules, and that'll produce breakage sooner rather than later -- the twisted.python.modules code, for example, would fail with a loud AttributeError, rather than going on to silently assume that a module with a dummy __file__ isn't a package. (Which is NOT a valid assumption *now*, btw, as I'll explain below.) Anyway, if you have any suggestions for verbiage that should be added to the PEP to clarify these assumptions, I'd be happy to add them. However, I think that the real problem you're encountering at the moment has more to do with making assumptions about the Python import ecosystem that aren't valid today, and haven't been valid since at least the introduction of PEP 302, if not earlier import hook systems as well. > But the whole "pure virtual" mechanism here seems to pile even > more inconsistency on top of an already irritatingly inconsistent > import mechanism. I was reasonably happy with my attempt to paper > over PEP 302's weirdnesses from a user perspective: > >http://twistedmatrix.com/documents/11.0.0/api/twisted.python.modules.html > >(or https://launchpad.net/modules if >you are not a Twisted user) > >Users of this API can traverse the module hierarchy with certain >expectations; each module or package would have .pathEntry and >.filePath attributes, each of which would refer to the appropriate >place. Of course __path__ complicates things a bit, but so it goes. I don't mean to be critical, and no doubt what you've written works fine for your current requirements, but on my quick attempt to skim through the code I found many things which appear to me to be incompatible with PEP 302. That is, the above code hardocdes a variety of assumptions about the import system that haven't been true since Python 2.3. (For example, it assumes that the contents of sys.path strings have inspectable semantics, that the contents of __file__ can tell you things about the module-ness or package-ness of a module object, etc.) If you want to fully support PEP 302, you might want to consider making this a wrapper over the corresponding pkgutil APIs (available since Python 2.5) that do roughly the same things, but which delegate all path string inspection to importer objects and allow extensible delegation for importers that don't support the optional methods involved. (Of course, if the pkgutil APIs are missing something you need, perhaps you could propose additions.) >Now it seems like pure virtual packages are going to introduce a new >type of special case into the hierarchy which have neither >.pathEntry nor .filePath objects. The problem is that your API's notion that these things exist as coherent concepts was never really a valid assumption in the first place. .pth files and namespace packages already meant that the idea of a package coming from a single path entry made no sense. And namespace packages installed by setuptools' system packaging mode *don't have a __file__ attribute* today... heck they don't have __init__ modules, either. So, adding virtual packages isn't actually going to change anything, except perhaps by making these scenarios more common. From solipsis at pitrou.net Fri Aug 12 17:42:26 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 12 Aug 2011 17:42:26 +0200 Subject: [Python-Dev] GIL removal question References: <4E44294F.5060005@molden.no> Message-ID: <20110812174226.0cd068b1@pitrou.net> On Fri, 12 Aug 2011 09:32:23 -0500 VanL wrote: > On 8/11/2011 2:11 PM, Sturla Molden wrote: > > > > (b) another threading model (e.g. one interpreter per thread, as in Tcl, > > Erlang, or .NET app domains). > > We are close to this, in that we already have baked-in support for > subinterpreters. Out of curiosity, why isn't this being pursued? Because it is half-baked, breaks with some features in some extension modules, and still requires the GIL for shared data structures. Regards Antoine. From status at bugs.python.org Fri Aug 12 18:07:27 2011 From: status at bugs.python.org (Python tracker) Date: Fri, 12 Aug 2011 18:07:27 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20110812160727.404561CC0B@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2011-08-05 - 2011-08-12) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 2923 (+24) closed 21602 (+23) total 24525 (+47) Open issues with patches: 1264 Issues opened (35) ================== #12032: Tools/Scripts/crlf.py needs updating for python 3+ http://bugs.python.org/issue12032 reopened by eric.araujo #12701: Apple's clang 2.1 (xcode 4.1, OSX 10.7) optimizer miscompiles http://bugs.python.org/issue12701 opened by deadshort #12702: shutil.copytree() should use os.lutimes() to copy the metadata http://bugs.python.org/issue12702 opened by petri.lehtinen #12703: Improve error reporting for packaging.util.resolve_name http://bugs.python.org/issue12703 opened by Natim #12704: Language References does not specify exception raised by final http://bugs.python.org/issue12704 opened by Nikratio #12705: Make compile('1\n2\n', '', 'single') raise an exception instea http://bugs.python.org/issue12705 opened by Devin Jeanpierre #12706: timeout sentinel in ftplib and poplib documentation http://bugs.python.org/issue12706 opened by orsenthil #12707: Deprecate addinfourl getters http://bugs.python.org/issue12707 opened by ezio.melotti #12708: multiprocessing.Pool is missing a starmap[_async]() method. http://bugs.python.org/issue12708 opened by hynek #12711: Explain tracker components in devguide http://bugs.python.org/issue12711 opened by eric.araujo #12712: weave build_tools library identification http://bugs.python.org/issue12712 opened by Tim.Holme #12713: argparse: allow abbreviation of sub commands by users http://bugs.python.org/issue12713 opened by pwil3058 #12716: Reorganize os docs for files/dirs/fds http://bugs.python.org/issue12716 opened by benjamin.peterson #12720: Expose linux extended filesystem attributes http://bugs.python.org/issue12720 opened by benjamin.peterson #12721: Chaotic use of helper functions in test_shutil for reading and http://bugs.python.org/issue12721 opened by hynek #12723: Provide an API in tkSimpleDialog for defining custom validatio http://bugs.python.org/issue12723 opened by rabbidous #12725: Docs: Odd phrase "floating seconds" in socket.html http://bugs.python.org/issue12725 opened by Cris.Simpson #12726: explain why locale.getlocale() does not read system's locales http://bugs.python.org/issue12726 opened by alexis #12728: Python re lib fails case insensitive matches on Unicode data http://bugs.python.org/issue12728 opened by tchrist #12729: Python lib re cannot handle Unicode properly due to narrow/wid http://bugs.python.org/issue12729 opened by tchrist #12730: Python's casemapping functions are untrustworthy due to narrow http://bugs.python.org/issue12730 opened by tchrist #12731: python lib re uses obsolete sense of \w in full violation of U http://bugs.python.org/issue12731 opened by tchrist #12732: Can't portably use Unicode in Python identifiers http://bugs.python.org/issue12732 opened by tchrist #12733: Request for grapheme support in Python re lib http://bugs.python.org/issue12733 opened by tchrist #12734: Request for property support in Python re lib http://bugs.python.org/issue12734 opened by tchrist #12735: request full Unicode collation support in std python library http://bugs.python.org/issue12735 opened by tchrist #12737: string.title() is overzealous by upcasing combining marks ina http://bugs.python.org/issue12737 opened by tchrist #12738: Bug in multiprocessing.JoinableQueue() implementation on Ubunt http://bugs.python.org/issue12738 opened by Michael.Hall #12739: read stuck with multithreading and simultaneous subprocess.Pop http://bugs.python.org/issue12739 opened by SAPikachu #12740: Add struct.Struct.nmemb http://bugs.python.org/issue12740 opened by skrah #12741: Implementation of shutil.move http://bugs.python.org/issue12741 opened by David.Townshend #12742: Add support for CESU-8 encoding http://bugs.python.org/issue12742 opened by adalx #12743: C API marshalling doc contains XXX http://bugs.python.org/issue12743 opened by JJeffries #12715: Add symlink support to shutil functions http://bugs.python.org/issue12715 opened by petri.lehtinen #12736: Request for python casemapping functions to use full not simpl http://bugs.python.org/issue12736 opened by tchrist Most recent 15 issues with no replies (15) ========================================== #12743: C API marshalling doc contains XXX http://bugs.python.org/issue12743 #12742: Add support for CESU-8 encoding http://bugs.python.org/issue12742 #12741: Implementation of shutil.move http://bugs.python.org/issue12741 #12740: Add struct.Struct.nmemb http://bugs.python.org/issue12740 #12739: read stuck with multithreading and simultaneous subprocess.Pop http://bugs.python.org/issue12739 #12737: string.title() is overzealous by upcasing combining marks ina http://bugs.python.org/issue12737 #12736: Request for python casemapping functions to use full not simpl http://bugs.python.org/issue12736 #12735: request full Unicode collation support in std python library http://bugs.python.org/issue12735 #12733: Request for grapheme support in Python re lib http://bugs.python.org/issue12733 #12732: Can't portably use Unicode in Python identifiers http://bugs.python.org/issue12732 #12731: python lib re uses obsolete sense of \w in full violation of U http://bugs.python.org/issue12731 #12730: Python's casemapping functions are untrustworthy due to narrow http://bugs.python.org/issue12730 #12728: Python re lib fails case insensitive matches on Unicode data http://bugs.python.org/issue12728 #12726: explain why locale.getlocale() does not read system's locales http://bugs.python.org/issue12726 #12725: Docs: Odd phrase "floating seconds" in socket.html http://bugs.python.org/issue12725 Most recent 15 issues waiting for review (15) ============================================= #12740: Add struct.Struct.nmemb http://bugs.python.org/issue12740 #12723: Provide an API in tkSimpleDialog for defining custom validatio http://bugs.python.org/issue12723 #12721: Chaotic use of helper functions in test_shutil for reading and http://bugs.python.org/issue12721 #12720: Expose linux extended filesystem attributes http://bugs.python.org/issue12720 #12711: Explain tracker components in devguide http://bugs.python.org/issue12711 #12708: multiprocessing.Pool is missing a starmap[_async]() method. http://bugs.python.org/issue12708 #12691: tokenize.untokenize is broken http://bugs.python.org/issue12691 #12684: profile does not dump stats on exception like cProfile does http://bugs.python.org/issue12684 #12668: 3.2 What's New: it's integer->string, not the opposite http://bugs.python.org/issue12668 #12666: map semantic change not documented in What's New http://bugs.python.org/issue12666 #12656: test.test_asyncore: add tests for AF_INET6 and AF_UNIX sockets http://bugs.python.org/issue12656 #12652: Keep test.support docs out of the global docs index http://bugs.python.org/issue12652 #12650: Subprocess leaks fd upon kill() http://bugs.python.org/issue12650 #12646: zlib.Decompress.decompress/flush do not raise any exceptions w http://bugs.python.org/issue12646 #12639: msilib Directory.start_component() fails if keyfile is not Non http://bugs.python.org/issue12639 Top 10 most discussed issues (10) ================================= #12682: Meaning of 'accepted' resolution as documented in devguide http://bugs.python.org/issue12682 10 msgs #12301: Use :data:`sys.thing` instead of ``sys.thing`` throughout http://bugs.python.org/issue12301 9 msgs #12191: Add shutil.chown to allow to use user and group name (and not http://bugs.python.org/issue12191 8 msgs #12666: map semantic change not documented in What's New http://bugs.python.org/issue12666 7 msgs #12672: Some problems in documentation extending/newtypes.html http://bugs.python.org/issue12672 7 msgs #2857: Add "java modified utf-8" codec http://bugs.python.org/issue2857 6 msgs #12721: Chaotic use of helper functions in test_shutil for reading and http://bugs.python.org/issue12721 6 msgs #12541: Accepting Badly formed headers in urllib HTTPBasicAuth http://bugs.python.org/issue12541 5 msgs #12701: Apple's clang 2.1 (xcode 4.1, OSX 10.7) optimizer miscompiles http://bugs.python.org/issue12701 5 msgs #12715: Add symlink support to shutil functions http://bugs.python.org/issue12715 5 msgs Issues closed (24) ================== #10087: HTML calendar is broken http://bugs.python.org/issue10087 closed by python-dev #10741: PyGILState_GetThisThreadState() lacks a doc entry http://bugs.python.org/issue10741 closed by sandro.tosi #12437: _ctypes.dlopen does not include errno in OSError http://bugs.python.org/issue12437 closed by pitrou #12575: add a AST validator http://bugs.python.org/issue12575 closed by python-dev #12608: crash in PyAST_Compile when running Python code http://bugs.python.org/issue12608 closed by meador.inge #12661: Add a new shutil.cleartree function to shutil module http://bugs.python.org/issue12661 closed by chin #12662: Add support for duplicate options in configparser http://bugs.python.org/issue12662 closed by lukasz.langa #12677: Turtle, fix right/left rotation orientation http://bugs.python.org/issue12677 closed by sandro.tosi #12687: Python 3.2 fails to load protocol 0 pickle http://bugs.python.org/issue12687 closed by pitrou #12694: crlf.py script from Tools doesn't work with Python 3.2 http://bugs.python.org/issue12694 closed by r.david.murray #12697: timeit documention still refers to the timeit.py script http://bugs.python.org/issue12697 closed by python-dev #12698: urllib does not use no_proxy when it has blanks http://bugs.python.org/issue12698 closed by python-dev #12699: strange behaviour of locale.getlocale() -> None, None http://bugs.python.org/issue12699 closed by ned.deily #12700: test_faulthandler fails on Mac OS X Lion http://bugs.python.org/issue12700 closed by haypo #12709: In multiprocessing, error_callback isn't documented for map_as http://bugs.python.org/issue12709 closed by sandro.tosi #12710: GTK crash http://bugs.python.org/issue12710 closed by sandro.tosi #12714: argparse.ArgumentParser.add_argument() documentation error http://bugs.python.org/issue12714 closed by r.david.murray #12717: ConfigParser._Chainmap error in 2.7.2 http://bugs.python.org/issue12717 closed by rhettinger #12718: Logical mistake of importer method in logging.config.BaseConfi http://bugs.python.org/issue12718 closed by vinay.sajip #12719: Direct access to tp_dict can lead to stale attributes http://bugs.python.org/issue12719 closed by python-dev #12722: Link to heapq source from docs.python.org not working http://bugs.python.org/issue12722 closed by python-dev #12724: Add Py_RETURN_NOTIMPLEMENTED http://bugs.python.org/issue12724 closed by brian.curtin #12727: "make" always reruns asdl_c.py http://bugs.python.org/issue12727 closed by python-dev #11047: Bad description for an entry in whatsnew/2.7 http://bugs.python.org/issue11047 closed by python-dev From barry at python.org Fri Aug 12 18:19:23 2011 From: barry at python.org (Barry Warsaw) Date: Fri, 12 Aug 2011 12:19:23 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> <4E43F156.8040008@netwok.org> Message-ID: <20110812121923.16216dd1@resist.wooz.org> On Aug 12, 2011, at 01:10 PM, Nick Coghlan wrote: >1. Accept the reality of that situation, and propose a mechanism that >minimises the impact of the resulting ambiguity on end users of Python >by allowing developers to be explicit about their target language. >This is the approach advocated in PEP 394. > >2. Tell the Arch developers (and anyone else inclined to point the >python name at python3) that they're wrong, and the python symlink >should, now and forever, always refer to a version of Python 2.x. FWIW, although I generally support the PEP, I also think that distros themselves have a responsibility to ensure their #! lines are correct, for scripts they install. Meaning, if it requires rewriting the #! line on OS package install, so be it. -Barry From catch-all at masklinn.net Fri Aug 12 18:51:10 2011 From: catch-all at masklinn.net (Xavier Morel) Date: Fri, 12 Aug 2011 18:51:10 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <4E44294F.5060005@molden.no> References: <4E44294F.5060005@molden.no> Message-ID: <4E01DE69-C873-48AE-B810-F1E467CBF792@masklinn.net> On 2011-08-11, at 21:11 , Sturla Molden wrote: > > (b) another threading model (e.g. one interpreter per thread, as in Tcl, Erlang, or .NET app domains). Nitpick: this is not correct re. erlang. While it is correct that it uses "another threading model" (one could even say "no threading model"), it's not a "one interpreter per thread" model at all: * Erlang uses "erlang processes", which are very cheap preempted *processes* (no shared memory). There have always been tens to thousands to millions of erlang processes per interpreter * A long time ago (before 2006 and the SMP VM, that was R11B) the erlang VM was single-threaded, so all those erlang processes ran in a single OS thread. To use multiple OS threads one had to create an erlang cluster (start multiple VMs and distribute spawned processes over those). However, this was already an m:n model, there were multiple erlang processes for each VM. * Since the introduction of the SMP VM, the erlang interpreter can create multiple *schedulers* (one per physical core by default), with each scheduler running in its own OS thread. In this model, there's a single interpreter and an m:n mapping of erlang processes to OS threads within that single interpreter. (interestingly, because -smp generates resource contention within the interpreter going back to pre-SMP by setting the number of schedulers per node to 1 can yield increased overall performances) From merwok at netwok.org Fri Aug 12 18:51:04 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Fri, 12 Aug 2011 18:51:04 +0200 Subject: [Python-Dev] Backporting howto/pyporting to 2.7 Message-ID: <4E4559F8.7040507@netwok.org> Hi everyone, I think it would be useful to have the ?Porting Python 2 Code to Python 3? HOWTO in the 2.7 docs, as I think that a lot of users consult the 2.7 docs. Is there any reason not to do it? Regards From rene at stranden.com Fri Aug 12 18:57:10 2011 From: rene at stranden.com (Rene Nejsum) Date: Fri, 12 Aug 2011 18:57:10 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <20110812174226.0cd068b1@pitrou.net> References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> Message-ID: <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> My two danish kroner on GIL issues?. I think I understand the background and need for GIL. Without it Python programs would have been cluttered with lock/synchronized statements and C-extensions would be harder to write. Thanks to Sturla Molden for he's explanation earlier in this thread. However, the GIL is also from a time, where single threaded programs running in single core CPU's was the common case. On a new MacBook Pro I have 8 core's and would expect my multithreaded Python program to run significantly fast than on a one-core CPU. Instead the program slows down to a much worse performance than on a one-core CPU. (Have a look at David Beazley's excellent talk on PyCon 2010 and he's paper http://www.dabeaz.com/GIL/ and http://blip.tv/carlfk/mindblowing-python-gil-2243379) For my viewpoint the multicore performance problems is the primary problem with the GIL, event though the other issues pointed out are valid. I still believe that the solution for Python would be to have an "every object is a thread/coroutine" solution a'la - ABCL (http://en.wikipedia.org/wiki/Actor-Based_Concurrent_Language) and - COOC (Concurrent Object Oriented C, (ftp://tsbgw.isl.rdc.toshiba.co.jp/pub/toshiba/cooc-beta.1.1.tar.Z) at least looked into as a alternative to a STM solution. But, my head is not big enough to fully understand this :-) kind regards /rene -------------- next part -------------- An HTML attachment was scrubbed... URL: From glyph at twistedmatrix.com Fri Aug 12 19:09:25 2011 From: glyph at twistedmatrix.com (Glyph Lefkowitz) Date: Fri, 12 Aug 2011 13:09:25 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <20110812152512.112A53A406B@sparrow.telecommunity.com> References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> <20110812152512.112A53A406B@sparrow.telecommunity.com> Message-ID: <0DA48AAD-78EE-496E-BF20-023B7A0868FD@twistedmatrix.com> On Aug 12, 2011, at 11:24 AM, P.J. Eby wrote: > That is, the above code hardocdes a variety of assumptions about the import system that haven't been true since Python 2.3. Thanks for this feedback. I honestly did not realize how old and creaky this code had gotten. It was originally developed for Python 2.4 and it certainly shows its age. Practically speaking, the code is correct for the bundled importers, and paths and zipfiles are all we've cared about thus far. > (For example, it assumes that the contents of sys.path strings have inspectable semantics, that the contents of __file__ can tell you things about the module-ness or package-ness of a module object, etc.) Unfortunately, the primary goal of this code is to do something impossible - walk the module hierarchy without importing any code. So some heuristics are necessary. Upon further reflection, PEP 402 _will_ make dealing with namespace packages from this code considerably easier: we won't need to do AST analysis to look for a __path__ attribute or anything gross like that improve correctness; we can just look in various directories on sys.path and accurately predict what __path__ will be synthesized to be. However, the isPackage() method can and should be looking at the module if it's already loaded, and not always guessing based on paths. The whole reason there's an 'importPackages' flag to walk() is that some applications of this code care more about accuracy than others, so it tries to be as correct as it can be. (Of course this is still wrong for the case where a __path__ is dynamically constructed by user code, but there's only so well one can do at that.) > If you want to fully support PEP 302, you might want to consider making this a wrapper over the corresponding pkgutil APIs (available since Python 2.5) that do roughly the same things, but which delegate all path string inspection to importer objects and allow extensible delegation for importers that don't support the optional methods involved. This code still needs to support Python 2.4, but I will make a note of this for future reference. > (Of course, if the pkgutil APIs are missing something you need, perhaps you could propose additions.) >> Now it seems like pure virtual packages are going to introduce a new type of special case into the hierarchy which have neither .pathEntry nor .filePath objects. > > The problem is that your API's notion that these things exist as coherent concepts was never really a valid assumption in the first place. .pth files and namespace packages already meant that the idea of a package coming from a single path entry made no sense. And namespace packages installed by setuptools' system packaging mode *don't have a __file__ attribute* today... heck they don't have __init__ modules, either. The fact that getModule('sys') breaks is reason enough to re-visit some of these design decisions. > So, adding virtual packages isn't actually going to change anything, except perhaps by making these scenarios more common. In that case, I guess it's a good thing; these bugs should be dealt with. Thanks for pointing them out. My opinion of PEP 402 has been completely reversed - although I'd still like to see a section about the module system from a library/tools author point of view rather than a time-traveling perl user's narrative :). -------------- next part -------------- An HTML attachment was scrubbed... URL: From merwok at netwok.org Fri Aug 12 19:17:20 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Fri, 12 Aug 2011 19:17:20 +0200 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Use real word in English text (i.e. not code) In-Reply-To: <4E455C8A.4030104@udel.edu> References: <4E455C8A.4030104@udel.edu> Message-ID: <4E456020.2020904@netwok.org> Hi, >> summary: >> Use real word in English text (i.e. not code) > I agree that 'arg' for 'argument is email/twitter-speak, not proper > document prose. >> - :synopsis: Command-line option and argument-parsing library. >> + :synopsis: Command-line option and argument parsing library. > However, 'argument-parsing' could/should be left hyphenated as a > compound adjective for the same reason 'command-line' is. With all due respect to the fact that you?re a native speaker and I?m not, here I disagree because I parse the sentence in this way (using parens to group things by precedence, if you want): (((command-line (option and argument)) parsing) library) To paraphrase, it?s a library to parse options and arguments from the command line, not a library to parse arguments and (missing verb-ing) options from the command line. (I?m not sure I?m clear.) > An arg you missed Yes, I looked for all instances of args but not arg. Will do. Regards From rdmurray at bitdance.com Fri Aug 12 19:34:49 2011 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 12 Aug 2011 13:34:49 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <20110812121923.16216dd1@resist.wooz.org> References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> <4E43F156.8040008@netwok.org> <20110812121923.16216dd1@resist.wooz.org> Message-ID: <20110812173449.8C9E22505A7@webabinitio.net> On Fri, 12 Aug 2011 12:19:23 -0400, Barry Warsaw wrote: > On Aug 12, 2011, at 01:10 PM, Nick Coghlan wrote: > >1. Accept the reality of that situation, and propose a mechanism that > >minimises the impact of the resulting ambiguity on end users of Python > >by allowing developers to be explicit about their target language. > >This is the approach advocated in PEP 394. > > > >2. Tell the Arch developers (and anyone else inclined to point the > >python name at python3) that they're wrong, and the python symlink > >should, now and forever, always refer to a version of Python 2.x. > > FWIW, although I generally support the PEP, I also think that distros > themselves have a responsibility to ensure their #! lines are correct, for > scripts they install. Meaning, if it requires rewriting the #! line on OS > package install, so be it. True, but I think that is orthogonal to the purposes of the PEP, which is about supporting writing of system independent scripts that are *not* provided by the distribution (or installed via packaging). And PEP 397 aims to extend that to Windows, as well. -- R. David Murray http://www.bitdance.com From barry at python.org Fri Aug 12 19:37:14 2011 From: barry at python.org (Barry Warsaw) Date: Fri, 12 Aug 2011 13:37:14 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <20110812173449.8C9E22505A7@webabinitio.net> References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> <4E43F156.8040008@netwok.org> <20110812121923.16216dd1@resist.wooz.org> <20110812173449.8C9E22505A7@webabinitio.net> Message-ID: <20110812133714.4a56ee98@resist.wooz.org> On Aug 12, 2011, at 01:34 PM, R. David Murray wrote: >True, but I think that is orthogonal to the purposes of the PEP, which >is about supporting writing of system independent scripts that are *not* >provided by the distribution (or installed via packaging). And PEP 397 >aims to extend that to Windows, as well. Yep, agreed. It probably should also inform #! transformations that pysetup could do. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From pje at telecommunity.com Fri Aug 12 20:33:47 2011 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 12 Aug 2011 14:33:47 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <0DA48AAD-78EE-496E-BF20-023B7A0868FD@twistedmatrix.com> References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> <20110812152512.112A53A406B@sparrow.telecommunity.com> <0DA48AAD-78EE-496E-BF20-023B7A0868FD@twistedmatrix.com> Message-ID: <20110812183406.7060D3A406B@sparrow.telecommunity.com> At 01:09 PM 8/12/2011 -0400, Glyph Lefkowitz wrote: >Upon further reflection, PEP 402 _will_ make dealing with namespace >packages from this code considerably easier: we won't need to do AST >analysis to look for a __path__ attribute or anything gross like >that improve correctness; we can just look in various directories on >sys.path and accurately predict what __path__ will be synthesized to be. The flip side of that is that you can't always know whether a directory is a virtual package without deep inspection: one consequence of PEP 402 is that any directory that contains a Python module (of whatever type), however deeply nested, will be a valid package name. So, you can't rule out that a given directory *might* be a package, without walking its entire reachable subtree. (Within the subset of directory names that are valid Python identifiers, of course.) However, you *can* quickly tell that a directory *might* be a package or is *probably* one: if it contains modules, or is the same name as an already-discovered module, it's a pretty safe bet that you can flag it as such. In any case, you probably should *not* do the building of a virtual path yourself; the protocols and APIs added by PEP 402 should allow you to simply ask for the path to be constructed on your behalf. Otherwise, you are going to be back in the same business of second-guessing arbitrary importer backends again! (E.g. note that PEP 402 does not say virtual package subpaths must be filesystem or zipfile subdirectories of their parents - an importer could just as easily allow you to treat subdirectories named 'twisted.python' as part of a virtual package with that name!) Anyway, pkgutil defines some extra methods that importers can implement to support module-walking, and part of the PEP 402 implementation should be to make this support virtual packages as well. >This code still needs to support Python 2.4, but I will make a note >of this for future reference. A suggestion: just take the pkgutil code and bundle it for Python 2.4 as something._pkgutil. There's very little about it that's 2.5+ specific, at least when I wrote the bits that do the module walking. Of course, the main disadvantage of pkgutil for your purposes is that it currently requires packages to be imported in order to walk their child modules. (IIRC, it does *not*, however, require them to be imported in order to discover their existence.) >In that case, I guess it's a good thing; these bugs should be dealt >with. Thanks for pointing them out. My opinion of PEP 402 has been >completely reversed - although I'd still like to see a section about >the module system from a library/tools author point of view rather >than a time-traveling perl user's narrative :). LOL. If you will propose the wording you'd like to see, I'll be happy to check it for any current-and-or-future incorrect assumptions. ;-) From fdrake at acm.org Fri Aug 12 20:42:24 2011 From: fdrake at acm.org (Fred Drake) Date: Fri, 12 Aug 2011 14:42:24 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Use real word in English text (i.e. not code) In-Reply-To: <4E456020.2020904@netwok.org> References: <4E455C8A.4030104@udel.edu> <4E456020.2020904@netwok.org> Message-ID: I think either Command-line option- and argument-parsing library. or Command-line option and argument parsing library. would be acceptable. -Fred -- Fred L. Drake, Jr.? ? "A person who won't read has no advantage over one who can't read." ?? --Samuel Langhorne Clemens From iacobcatalin at gmail.com Fri Aug 12 20:56:19 2011 From: iacobcatalin at gmail.com (Catalin Iacob) Date: Fri, 12 Aug 2011 20:56:19 +0200 Subject: [Python-Dev] Review request issue 12178 Message-ID: Could a core developer please review the patch I proposed for issue 12178 "csv writer doesn't escape escapechar"? Thanks! From sturla at molden.no Fri Aug 12 20:59:42 2011 From: sturla at molden.no (Sturla Molden) Date: Fri, 12 Aug 2011 20:59:42 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <4E01DE69-C873-48AE-B810-F1E467CBF792@masklinn.net> References: <4E44294F.5060005@molden.no> <4E01DE69-C873-48AE-B810-F1E467CBF792@masklinn.net> Message-ID: <4E45781E.2040608@molden.no> Den 12.08.2011 18:51, skrev Xavier Morel: > * Erlang uses "erlang processes", which are very cheap preempted > *processes* (no shared memory). There have always been tens to > thousands to millions of erlang processes per interpreter source > contention within the interpreter going back to pre-SMP by setting the > number of schedulers per node to 1 can yield increased overall > performances) Technically, one can make threads behave like processes if they don't share memory pages (though they will still share address space). Erlangs use of 'process' instead of 'thread' does not mean an Erlang process has to be implemented as an OS process. With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same. On Windows, there is an API function called HeapAlloc, which lets us allocate memory form a dedicated heap. The common use case is to prevent threads from sharing memory, thus behaving like light-weight processes (except address space is shared). On Unix, is is more common to use fork() to create new processes instead, as processes are more light-weight than on Windows. Sturla From sturla at molden.no Fri Aug 12 20:36:37 2011 From: sturla at molden.no (Sturla Molden) Date: Fri, 12 Aug 2011 20:36:37 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> Message-ID: <4E4572B5.4070109@molden.no> Den 12.08.2011 18:57, skrev Rene Nejsum: > My two danish kroner on GIL issues?. > > I think I understand the background and need for GIL. Without it > Python programs would have been cluttered with lock/synchronized > statements and C-extensions would be harder to write. Thanks to Sturla > Molden for he's explanation earlier in this thread. I doesn't seem I managed to explain it :( Yes, C extensions would be cluttered with synchronization statements, and that is annoying. But that was not my point all! Even with fine-grained locking in place, a system using reference counting will not scale on an multi-processor computer. Cache-lines containing reference counts will become incoherent between the processors, causing traffic jam on the memory bus. The technical term in parallel computing litterature is "false sharing". > However, the GIL is also from a time, where single threaded programs > running in single core CPU's was the common case. > > On a new MacBook Pro I have 8 core's and would expect my multithreaded > Python program to run significantly fast than on a one-core CPU. > > Instead the program slows down to a much worse performance than on a > one-core CPU. A multi-threaded program can be slower on a multi-processor computer as well, if it suffered from extensive "false sharing" (which Python programs nearly always will do). That is, instead of doing useful work, the processors are stepping on each others toes. So they spend the bulk of the time synchronizing cache lines with RAM instead of computing. On a computer with a single processor, there cannot be any false sharing. So even without a GIL, a multi-threaded program can often run faster on a single-processor computer. That might seem counter-intuitive at first. I seen this "inversed scaling" blamed on the GIL many times, but it's dead wrong. Multi-threading is hard to get right, because the programmer must ensure that processors don't access the same cache lines. This is one of the reasons why numerical programs based on MPI (multiple processes and IPC) are likely to perform better than numerical programs based on OpenMP (multiple threads and shared memory). As for Python, it means that it is easier to make a program based on multiprocessing scale well on a multi-processor computer, than a program based on threading and releasing the GIL. And that has nothing to do with the GIL! Albeit, I'd estimate 99% of Python programmers would blame it on the GIL. It has to do with what shared memory does if cache lines are shared. Intuition about what affects the performance of a multi-threaded program is very often wrong. If one needs parallel computing, multiple processes is much more likely to scale correctly. Threads are better reserved for things like non-blocking I/O. The problem with the GIL is merely what people think it does -- not what it actually does. It is so easy to blame a performance issue on the GIL, when it is actually the use of threads and shared memory per se that is the problem. Sturla From aaron at agoragames.com Fri Aug 12 21:17:59 2011 From: aaron at agoragames.com (Aaron Westendorf) Date: Fri, 12 Aug 2011 15:17:59 -0400 Subject: [Python-Dev] GIL removal question In-Reply-To: <4E45781E.2040608@molden.no> References: <4E44294F.5060005@molden.no> <4E01DE69-C873-48AE-B810-F1E467CBF792@masklinn.net> <4E45781E.2040608@molden.no> Message-ID: Even in the Erlang model, the afore-mentioned issues of bus contention put a cap on the number of threads you can run in any given application assuming there's any amount of cross-thread synchronization. I wrote a blog post on this subject with respect to my experience in tuning RabbitMQ on NUMA architectures. http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/ It should be noted that Erlang processes are not the same as OS processes. They are more akin to green threads, scheduled on N number of legit OS threads which are in turn run on C number of cores. The end effect is the same though, as the data is effectively shared across NUMA nodes, which runs into basic physical constraints. I used to think the GIL was a major bottleneck, and though I'm not fond of it, my recent experience has highlighted that *any* application which uses shared memory will have significant bus contention when scaling across all cores. The best course of action is shared-nothing MPI style, but in 64bit land, that can mean significant wasted address space. -Aaron On Fri, Aug 12, 2011 at 2:59 PM, Sturla Molden wrote: > Den 12.08.2011 18:51, skrev Xavier Morel: > >> * Erlang uses "erlang processes", which are very cheap preempted >> *processes* (no shared memory). There have always been tens to thousands to >> millions of erlang processes per interpreter source contention within the >> interpreter going back to pre-SMP by setting the number of schedulers per >> node to 1 can yield increased overall performances) >> > > Technically, one can make threads behave like processes if they don't share > memory pages (though they will still share address space). Erlangs use of > 'process' instead of 'thread' does not mean an Erlang process has to be > implemented as an OS process. With one interpreter per thread, and a malloc > that does not let threads share memory pages (one heap per thread), Python > could do the same. > > On Windows, there is an API function called HeapAlloc, which lets us > allocate memory form a dedicated heap. The common use case is to prevent > threads from sharing memory, thus behaving like light-weight processes > (except address space is shared). On Unix, is is more common to use fork() > to create new processes instead, as processes are more light-weight than on > Windows. > > Sturla > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.badger at gmail.com Fri Aug 12 21:22:17 2011 From: a.badger at gmail.com (Toshio Kuratomi) Date: Fri, 12 Aug 2011 12:22:17 -0700 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: <20110812121923.16216dd1@resist.wooz.org> References: <4E43E8D0.40201@netwok.org> <20110811150022.02A192505A7@webabinitio.net> <4E43F156.8040008@netwok.org> <20110812121923.16216dd1@resist.wooz.org> Message-ID: <20110812192217.GR5771@unaka.lan> On Fri, Aug 12, 2011 at 12:19:23PM -0400, Barry Warsaw wrote: > On Aug 12, 2011, at 01:10 PM, Nick Coghlan wrote: > > >1. Accept the reality of that situation, and propose a mechanism that > >minimises the impact of the resulting ambiguity on end users of Python > >by allowing developers to be explicit about their target language. > >This is the approach advocated in PEP 394. > > > >2. Tell the Arch developers (and anyone else inclined to point the > >python name at python3) that they're wrong, and the python symlink > >should, now and forever, always refer to a version of Python 2.x. > > FWIW, although I generally support the PEP, I also think that distros > themselves have a responsibility to ensure their #! lines are correct, for > scripts they install. Meaning, if it requires rewriting the #! line on OS > package install, so be it. > +1 with the one caveat... it's nice to upstream fixes. If there's a simple thing like python == python-2 and python3 == python-3 everywhere, this is possible. If there's something like python2 == python-2 and python-3 == python3 everywhere, this is also possible. The problem is that: the latter is not the case (python from python.org itself doesn't produce a python2 symlink on install) and historically the former was the case but since python-dev rejected the notion that python == python-2 that is no long true. As long as it's just Arch, there's still time to go with #2. #1 is not a complete solution (especially because /usr/bin/python2 will never exist on some historical systems [not ones I run though, so someone else will need to beat that horse :-)]) but is better than where we are now where there is no guidance on what's right and wrong at all. -Toshio -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From catch-all at masklinn.net Fri Aug 12 22:17:22 2011 From: catch-all at masklinn.net (Xavier Morel) Date: Fri, 12 Aug 2011 22:17:22 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <4E45781E.2040608@molden.no> References: <4E44294F.5060005@molden.no> <4E01DE69-C873-48AE-B810-F1E467CBF792@masklinn.net> <4E45781E.2040608@molden.no> Message-ID: On 2011-08-12, at 20:59 , Sturla Molden wrote: > Den 12.08.2011 18:51, skrev Xavier Morel: >> * Erlang uses "erlang processes", which are very cheap preempted *processes* (no shared memory). There have always been tens to thousands to millions of erlang processes per interpreter source contention within the interpreter going back to pre-SMP by setting the number of schedulers per node to 1 can yield increased overall performances) > > Technically, one can make threads behave like processes if they don't share memory pages (though they will still share address space). Erlangs use of 'process' instead of 'thread' does not mean an Erlang process has to be implemented as an OS process. Of course not. I did not write anything implying that. > With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same. Again, my point is that Erlang does not work "with one interpreter per thread". Which was your claim. From glyph at twistedmatrix.com Fri Aug 12 23:03:47 2011 From: glyph at twistedmatrix.com (Glyph Lefkowitz) Date: Fri, 12 Aug 2011 17:03:47 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <20110812183406.7060D3A406B@sparrow.telecommunity.com> References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> <20110812152512.112A53A406B@sparrow.telecommunity.com> <0DA48AAD-78EE-496E-BF20-023B7A0868FD@twistedmatrix.com> <20110812183406.7060D3A406B@sparrow.telecommunity.com> Message-ID: <08D6CFE0-3039-4DAF-944F-0F0DEFFFD87A@twistedmatrix.com> On Aug 12, 2011, at 2:33 PM, P.J. Eby wrote: > At 01:09 PM 8/12/2011 -0400, Glyph Lefkowitz wrote: >> Upon further reflection, PEP 402 _will_ make dealing with namespace packages from this code considerably easier: we won't need to do AST analysis to look for a __path__ attribute or anything gross like that improve correctness; we can just look in various directories on sys.path and accurately predict what __path__ will be synthesized to be. > > The flip side of that is that you can't always know whether a directory is a virtual package without deep inspection: one consequence of PEP 402 is that any directory that contains a Python module (of whatever type), however deeply nested, will be a valid package name. So, you can't rule out that a given directory *might* be a package, without walking its entire reachable subtree. (Within the subset of directory names that are valid Python identifiers, of course.) Are there any rules about passing invalid identifiers to __import__ though, or is that just less likely? :) > However, you *can* quickly tell that a directory *might* be a package or is *probably* one: if it contains modules, or is the same name as an already-discovered module, it's a pretty safe bet that you can flag it as such. I still like the idea of a 'marker' file. It would be great if there were a new marker like "__package__.py". I say this more for the benefit of users looking at a directory on their filesystem and trying to understand whether this is a package or not than I do for my own programmatic tools though; it's already hard enough to understand the package-ness of a part of your filesystem and its interactions with PYTHONPATH; making directories mysteriously and automatically become packages depending on context will worsen that situation, I think. I also have this not-terribly-well-defined idea that it would be handy for different providers of the _contents_ of namespace packages to provide their own instrumentation to be made aware that they've been added to the __path__ of a particular package. This may be a solution in search of a problem, but I imagine that each __package__.py would be executed in the same module namespace. This would allow namespace packages to do things like set up compatibility aliases, lazy imports, plugin registrations, etc, as they currently do with __init__.py. Perhaps it would be better to define its relationship to the package-module namespace in a more sensible way than "execute all over each other in no particular order". Also, if I had my druthers, Python would raise an exception if someone added a directory marked as a package to sys.path, to refuse to import things from it, and when a submodule was run as a script, add the nearest directory not marked as a package to sys.path, rather than the script's directory itself. The whole "__name__ is wrong because your current directory was wrong when you ran that command" thing is so confusing to explain that I hope we can eventually consign it to the dustbin of history. But if you can't even reasonably guess whether a directory is supposed to be an entry on sys.path or a package, that's going to be really hard to do. > In any case, you probably should *not* do the building of a virtual path yourself; the protocols and APIs added by PEP 402 should allow you to simply ask for the path to be constructed on your behalf. Otherwise, you are going to be back in the same business of second-guessing arbitrary importer backends again! What do you mean "building of a virtual path"? > (E.g. note that PEP 402 does not say virtual package subpaths must be filesystem or zipfile subdirectories of their parents - an importer could just as easily allow you to treat subdirectories named 'twisted.python' as part of a virtual package with that name!) > > Anyway, pkgutil defines some extra methods that importers can implement to support module-walking, and part of the PEP 402 implementation should be to make this support virtual packages as well. The more that this can focus on module-walking without executing code, the happier I'll be :). >> This code still needs to support Python 2.4, but I will make a note of this for future reference. > > A suggestion: just take the pkgutil code and bundle it for Python 2.4 as something._pkgutil. There's very little about it that's 2.5+ specific, at least when I wrote the bits that do the module walking. > > Of course, the main disadvantage of pkgutil for your purposes is that it currently requires packages to be imported in order to walk their child modules. (IIRC, it does *not*, however, require them to be imported in order to discover their existence.) One of the stipulations of this code is that it might give different results when the modules are loaded and not. So it's fine to inspect that first and then invoke pkgutil only in the 'loaded' case, with the knowledge that the not-loaded case may be incorrect in the face of certain configurations. >> In that case, I guess it's a good thing; these bugs should be dealt with. Thanks for pointing them out. My opinion of PEP 402 has been completely reversed - although I'd still like to see a section about the module system from a library/tools author point of view rather than a time-traveling perl user's narrative :). > > LOL. > > If you will propose the wording you'd like to see, I'll be happy to check it for any current-and-or-future incorrect assumptions. ;-) If I can come up with anything I will definitely send it along. -glyph -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Aug 12 23:38:25 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 12 Aug 2011 17:38:25 -0400 Subject: [Python-Dev] GIL removal question In-Reply-To: <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> Message-ID: On Fri, Aug 12, 2011 at 12:57 PM, Rene Nejsum wrote: > I think I understand the background and need for GIL. Without it Python > programs would have been cluttered with lock/synchronized statements and > C-extensions would be harder to write. No, sorry, the first half of this is incorrect: with or without the GIL *Python* code would need the same amount of fine-grained locking. (The part about C extensions is correct.) I am butting in because this is a common misunderstanding that really needs to be squashed whenever it is aired -- the GIL does *not* help Python code to synchronize. A thread-switch can occur between any two bytecode opcodes. Without the GIL, atomic operations (e.g. dict lookups that doesn't require evaluation of __eq__ or __hash__ implemented in Python) are still supposed to be atomic. -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 12 23:42:39 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 12 Aug 2011 17:42:39 -0400 Subject: [Python-Dev] [PEPs] Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) In-Reply-To: References: <4E43E8D0.40201@netwok.org> Message-ID: On Thu, Aug 11, 2011 at 6:05 PM, Terry Reedy wrote: > There was no comparable transition. Python 2.0 was basically 1.6 renamed for > a different distributor. No that's not true. If you compare the "what's new" sections there is quite a large difference between 1.6 and 2.0, despite being released simultaneously. > I regard Python 2.2, which introduced new-style, as > the beginning of Python 2 as something significantly different from Python > 1. Just compare: http://www.python.org/download/releases/2.0/ http://www.python.org/download/releases/1.6/ No argument that 2.2 was a big jump for the type system -- but not for Unicode. > I suppose one could also point to the earlier intro of unicode. In 1.6. (But internally we called it the "contractual obligation release", a Monty Python reference.) > The new > iterator protocol was also a major change. In any case, back compatibility > was kept in all three respects (and others) until Python 3. (I gotta go, but I don't think it was such a big deal -- it was very carefully made backwards compatible.) -- --Guido van Rossum (python.org/~guido) From pje at telecommunity.com Sat Aug 13 00:42:31 2011 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 12 Aug 2011 18:42:31 -0400 Subject: [Python-Dev] PEP 402: Simplified Package Layout and Partitioning In-Reply-To: <08D6CFE0-3039-4DAF-944F-0F0DEFFFD87A@twistedmatrix.com> References: <4E43E9A6.7020608@netwok.org> <20110811113952.2e257351@resist.wooz.org> <41282FF3-4DAF-4996-B745-E0BEA477FB01@twistedmatrix.com> <20110812152512.112A53A406B@sparrow.telecommunity.com> <0DA48AAD-78EE-496E-BF20-023B7A0868FD@twistedmatrix.com> <20110812183406.7060D3A406B@sparrow.telecommunity.com> <08D6CFE0-3039-4DAF-944F-0F0DEFFFD87A@twistedmatrix.com> Message-ID: <20110812224246.212FA3A406B@sparrow.telecommunity.com> At 05:03 PM 8/12/2011 -0400, Glyph Lefkowitz wrote: >Are there any rules about passing invalid identifiers to __import__ >though, or is that just less likely? :) I suppose you have a point there. ;-) >I still like the idea of a 'marker' file. It would be great if >there were a new marker like "__package__.py". Having any required marker file makes separately-installable portions of a package impossible, since it would then be in conflict at installation time. The (semi-)competing proposal, PEP 382, is based on allowing each portion to have a differently-named marker; we came up with PEP 402 as a way to get rid of the need for any marker files (not to mention the bikeshedding involved.) >What do you mean "building of a virtual path"? Constructing the __path__-to-be of a not-yet-imported virtual package. The PEP defines a protocol for constructing this, by asking the importer objects to provide __path__ entries, and it does not require anything to be imported. So there's no reason to re-implement the algorithm yourself. >The more that this can focus on module-walking without executing >code, the happier I'll be :). Virtual packages actually improve on this situation, in that a virtual path can be computed without the need to import the package. (Assuming a submodule or subpackage doesn't munge the __path__, of course.) From rene at stranden.com Sat Aug 13 00:51:40 2011 From: rene at stranden.com (Rene Nejsum) Date: Sat, 13 Aug 2011 00:51:40 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> Message-ID: <92060770-873B-4F54-B1FC-DB2840464A30@stranden.com> Thank you for the clarification, I should have been more precise... On 12/08/2011, at 23.38, Guido van Rossum wrote: > On Fri, Aug 12, 2011 at 12:57 PM, Rene Nejsum wrote: >> I think I understand the background and need for GIL. Without it Python >> programs would have been cluttered with lock/synchronized statements and >> C-extensions would be harder to write. > > No, sorry, the first half of this is incorrect: with or without the > GIL *Python* code would need the same amount of fine-grained locking. > (The part about C extensions is correct.) I am butting in because this > is a common misunderstanding that really needs to be squashed whenever > it is aired -- the GIL does *not* help Python code to synchronize. A > thread-switch can occur between any two bytecode opcodes. Without the > GIL, atomic operations (e.g. dict lookups that doesn't require > evaluation of __eq__ or __hash__ implemented in Python) are still > supposed to be atomic. > > -- > --Guido van Rossum (python.org/~guido) From tjreedy at udel.edu Sat Aug 13 03:36:24 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 12 Aug 2011 21:36:24 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Use real word in English text (i.e. not code) In-Reply-To: <4E456020.2020904@netwok.org> References: <4E455C8A.4030104@udel.edu> <4E456020.2020904@netwok.org> Message-ID: On 8/12/2011 1:17 PM, ?ric Araujo wrote: > With all due respect to the fact that you?re a native speaker and I?m > not, here I disagree because I parse the sentence in this way (using > parens to group things by precedence, if you want): You are right, I misparsed without considering the full context. You actually mean "Command-line option-and-argument-parsing library." But multiple compound-noun adjectives are awkward and the above is ugly. Would "Command-line library for parsing options and arguments" fit? -- Terry Jan Reedy From ben+python at benfinney.id.au Sat Aug 13 03:50:41 2011 From: ben+python at benfinney.id.au (Ben Finney) Date: Sat, 13 Aug 2011 11:50:41 +1000 Subject: [Python-Dev] [Python-checkins] cpython (3.2): Use real word in English text (i.e. not code) References: <4E455C8A.4030104@udel.edu> <4E456020.2020904@netwok.org> Message-ID: <87bovuf5am.fsf@benfinney.id.au> Terry Reedy writes: > But multiple compound-noun adjectives are awkward and the above is ugly. > Would "Command-line library for parsing options and arguments" fit? Better, but the binding is still wrong. The ?command-line? should instead be a modifier for ?options and arguments?. So: Library for parsing command-line options and arguments -- \ ?Please to bathe inside the tub.? ?hotel room, Japan | `\ | _o__) | Ben Finney From greg.ewing at canterbury.ac.nz Sat Aug 13 02:31:52 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 13 Aug 2011 12:31:52 +1200 Subject: [Python-Dev] GIL removal question In-Reply-To: <4E45781E.2040608@molden.no> References: <4E44294F.5060005@molden.no> <4E01DE69-C873-48AE-B810-F1E467CBF792@masklinn.net> <4E45781E.2040608@molden.no> Message-ID: <4E45C5F8.6060107@canterbury.ac.nz> Sturla Molden wrote: > With one interpreter per thread, and > a malloc that does not let threads share memory pages (one heap per > thread), Python could do the same. Wouldn't that be more or less equivalent to running each thread in a separate process? -- Greg From stefan_ml at behnel.de Sat Aug 13 08:12:10 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 13 Aug 2011 08:12:10 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> Message-ID: Guido van Rossum, 12.08.2011 23:38: > On Fri, Aug 12, 2011 at 12:57 PM, Rene Nejsum wrote: >> I think I understand the background and need for GIL. Without it Python >> programs would have been cluttered with lock/synchronized statements and >> C-extensions would be harder to write. > > No, sorry, the first half of this is incorrect: with or without the > GIL *Python* code would need the same amount of fine-grained locking. > (The part about C extensions is correct.) I am butting in because this > is a common misunderstanding that really needs to be squashed whenever > it is aired -- the GIL does *not* help Python code to synchronize. A > thread-switch can occur between any two bytecode opcodes. Without the > GIL, atomic operations (e.g. dict lookups that doesn't require > evaluation of __eq__ or __hash__ implemented in Python) are still > supposed to be atomic. And in this context, it's worth mentioning that even C code can be bitten by the GIL being temporarily released when calling back into the interpreter. Only plain C code sequences safely keep the GIL, including many (but not all) calls to the C-API. Stefan From g.brandl at gmx.net Sat Aug 13 08:23:18 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 13 Aug 2011 08:23:18 +0200 Subject: [Python-Dev] =?windows-1252?q?cpython_=283=2E2=29=3A_Fix_find_com?= =?windows-1252?q?mand_in_makefile_=93funny=94_target?= In-Reply-To: References: Message-ID: On 08/12/11 18:03, eric.araujo wrote: > http://hg.python.org/cpython/rev/1b818f3639ef > changeset: 71826:1b818f3639ef > branch: 3.2 > parent: 71823:8032ea4c3619 > user: ?ric Araujo > date: Wed Aug 10 02:01:32 2011 +0200 > summary: > Fix find command in makefile ?funny? target > > files: > Makefile.pre.in | 4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > > diff --git a/Makefile.pre.in b/Makefile.pre.in > --- a/Makefile.pre.in > +++ b/Makefile.pre.in > @@ -1283,7 +1283,7 @@ > > # Find files with funny names > funny: > - find $(DISTDIRS) \ > + find $(SUBDIRS) $(SUBDIRSTOO) \ > -name .svn -prune \ > -o -type d \ > -o -name '*.[chs]' \ > @@ -1313,7 +1313,7 @@ > -o -name .hgignore \ > -o -name .bzrignore \ > -o -name MANIFEST \ > - -o -print > + -print This actually broke the command; it only outputs "MANIFEST" now if present. The previous version is correct; please revert this part. Georg From guido at python.org Sat Aug 13 15:08:16 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 13 Aug 2011 09:08:16 -0400 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> Message-ID: On Sat, Aug 13, 2011 at 2:12 AM, Stefan Behnel wrote: > Guido van Rossum, 12.08.2011 23:38: >> >> On Fri, Aug 12, 2011 at 12:57 PM, Rene Nejsum wrote: >>> >>> I think I understand the background and need for GIL. Without it Python >>> programs would have been cluttered with lock/synchronized statements and >>> C-extensions would be harder to write. >> >> No, sorry, the first half of this is incorrect: with or without the >> GIL *Python* code would need the same amount of fine-grained locking. >> (The part about C extensions is correct.) I am butting in because this >> is a common misunderstanding that really needs to be squashed whenever >> it is aired -- the GIL does *not* help Python code to synchronize. A >> thread-switch can occur between any two bytecode opcodes. Without the >> GIL, atomic operations (e.g. dict lookups that doesn't require >> evaluation of __eq__ or __hash__ implemented in Python) are still >> supposed to be atomic. > > And in this context, it's worth mentioning that even C code can be bitten by > the GIL being temporarily released when calling back into the interpreter. > Only plain C code sequences safely keep the GIL, including many (but not > all) calls to the C-API. And, though mostly off-topic, the worst problem with C code, calling back into Python, and the GIL that I have seen (several times): Suppose you are calling some complex C library that creates threads itself, where those threads may also call back into Python. Here you have to put a block around each Python callback that acquires the GIL before and releases it after, since the new threads (created by C code) start without the GIL acquired. I remember a truly nasty incident where the latter was done, but the main thread did not release the GIL since it was returning directly to Python (which would of course release the GIL every so many opcodes so the callbacks would run). But under certain conditions the block with the acquire-release-GIL code around a Python callback was invoked in the main thread (when a validation problem was detected early), and since the main thread didn't release the GIL around the call into the C code, it hung in a nasty spot. Add many layers of software, and a hard-to-reproduce error condition that triggers this, and you have a problem that's very hard to debug... -- --Guido van Rossum (python.org/~guido) From solipsis at pitrou.net Sat Aug 13 17:43:46 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 13 Aug 2011 17:43:46 +0200 Subject: [Python-Dev] GIL removal question References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> Message-ID: <20110813174346.1a034a0d@pitrou.net> On Sat, 13 Aug 2011 09:08:16 -0400 Guido van Rossum wrote: > > And, though mostly off-topic, the worst problem with C code, calling > back into Python, and the GIL that I have seen (several times): > Suppose you are calling some complex C library that creates threads > itself, where those threads may also call back into Python. Here you > have to put a block around each Python callback that acquires the GIL > before and releases it after, since the new threads (created by C > code) start without the GIL acquired. I remember a truly nasty > incident where the latter was done, but the main thread did not > release the GIL since it was returning directly to Python (which would > of course release the GIL every so many opcodes so the callbacks would > run). But under certain conditions the block with the > acquire-release-GIL code around a Python callback was invoked in the > main thread (when a validation problem was detected early), and since > the main thread didn't release the GIL around the call into the C > code, it hung in a nasty spot. Add many layers of software, and a > hard-to-reproduce error condition that triggers this, and you have a > problem that's very hard to debug... These days we have PyGILState_Ensure(): http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure and even dedicated documentation: http://docs.python.org/dev/c-api/init.html#non-python-created-threads ;) Regards Antoine. From doug.hellmann at gmail.com Sun Aug 14 01:08:40 2011 From: doug.hellmann at gmail.com (Doug Hellmann) Date: Sat, 13 Aug 2011 19:08:40 -0400 Subject: [Python-Dev] Fwd: Mirroring Python repos to Bitbucket References: <4E42DF4A.8010407@atlassian.com> Message-ID: Charles McLaughlin of Atlassian has set up mirrors of the Mercurial repositories hosted on python.org as part of the ongoing infrastructure improvement work. These mirrors will give us a public fail-over repository in the event that hg.python.org goes offline unexpectedly, and also provide features such as RSS feeds of changes for users interested in monitoring the repository passively. Thank you, Charles for setting this up and Atlassian for hosting it! Doug Begin forwarded message: > From: Charles McLaughlin > Date: August 10, 2011 3:43:06 PM EDT > To: Jesse Noller , Doug Hellmann > Subject: Mirroring Python repos to Bitbucket > > Hey, > > You guys expressed some interest in mirroring repos to Bitbucket a > couple weeks ago. I mentioned we mirror a few Python repos here: > > https://bitbucket.org/mirror/ > > But that doesn't cover everything from hg.python.org. So, I wrote a > little script that scrapes the Python HgWeb and mirrors everything to > Bitbucket. Here's the script in case you're curious: > > https://bitbucket.org/cmclaughlin/mirror-python-repos/ > > We're running the script hourly to keep the mirrors up to date. The > mirrored repos live under this URL: > > https://bitbucket.org/python_mirrors > > A few people here have mentioned "python_mirrors" is a strange name. I > can change that if you'd like. I don't have any better ideas though. > Also, anyone can fork my script if they see any room for improvement :) > > Please feel free to forward this email to mailing lists, etc to get the > word out. > > Regards, > Charles From solipsis at pitrou.net Sun Aug 14 01:23:01 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 14 Aug 2011 01:23:01 +0200 Subject: [Python-Dev] Mirroring Python repos to Bitbucket References: <4E42DF4A.8010407@atlassian.com> Message-ID: <20110814012301.45c46c1e@pitrou.net> On Sat, 13 Aug 2011 19:08:40 -0400 Doug Hellmann wrote: > > Charles McLaughlin of Atlassian has set up mirrors of the Mercurial repositories hosted on python.org as part of the ongoing infrastructure improvement work. These mirrors will give us a public fail-over repository in the event that hg.python.org goes offline unexpectedly, and also provide features such as RSS feeds of changes for users interested in monitoring the repository passively. There is already an RSS feed at http://hg.python.org/cpython/rss-log Another possibility is the gmane mirror of python-checkins, which has its own RSS feed: http://rss.gmane.org/gmane.comp.python.cvs Regards Antoine. From guido at python.org Sun Aug 14 01:26:47 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 13 Aug 2011 16:26:47 -0700 Subject: [Python-Dev] GIL removal question In-Reply-To: <20110813174346.1a034a0d@pitrou.net> References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> <20110813174346.1a034a0d@pitrou.net> Message-ID: On Sat, Aug 13, 2011 at 8:43 AM, Antoine Pitrou wrote: > On Sat, 13 Aug 2011 09:08:16 -0400 > Guido van Rossum wrote: >> >> And, though mostly off-topic, the worst problem with C code, calling >> back into Python, and the GIL that I have seen (several times): >> Suppose you are calling some complex C library that creates threads >> itself, where those threads may also call back into Python. Here you >> have to put a block around each Python callback that acquires the GIL >> before and releases it after, since the new threads (created by C >> code) start without the GIL acquired. I remember a truly nasty >> incident where the latter was done, but the main thread did not >> release the GIL since it was returning directly to Python (which would >> of course release the GIL every so many opcodes so the callbacks would >> run). But under certain conditions the block with the >> acquire-release-GIL code around a Python callback was invoked in the >> main thread (when a validation problem was detected early), and since >> the main thread didn't release the GIL around the call into the C >> code, it hung in a nasty spot. Add many layers of software, and a >> hard-to-reproduce error condition that triggers this, and you have a >> problem that's very hard to debug... > > These days we have PyGILState_Ensure(): > http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure > > and even dedicated documentation: > http://docs.python.org/dev/c-api/init.html#non-python-created-threads > > ;) That is awesome! -- --Guido van Rossum (python.org/~guido) From guido at python.org Sun Aug 14 02:08:20 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 13 Aug 2011 17:08:20 -0700 Subject: [Python-Dev] PEP 3154 - pickle protocol 4 In-Reply-To: <20110812125846.00a75cd1@pitrou.net> References: <20110812125846.00a75cd1@pitrou.net> Message-ID: On Fri, Aug 12, 2011 at 3:58 AM, Antoine Pitrou wrote: > This PEP is an attempt to foster a number of small incremental > improvements in a future pickle protocol version. The PEP process is > used in order to gather as many improvements as possible, because the > introduction of a new protocol version should be a rare occurrence. Thanks. this sounds like a good idea. That's not to say that I have already approved the PEP. :-) But from skimming it I have no objections except that it needs to be fleshed out. -- --Guido van Rossum (python.org/~guido) From doug.hellmann at gmail.com Sun Aug 14 02:42:46 2011 From: doug.hellmann at gmail.com (Doug Hellmann) Date: Sat, 13 Aug 2011 20:42:46 -0400 Subject: [Python-Dev] Mirroring Python repos to Bitbucket In-Reply-To: <20110814012301.45c46c1e@pitrou.net> References: <4E42DF4A.8010407@atlassian.com> <20110814012301.45c46c1e@pitrou.net> Message-ID: <4FD4019A-F3EF-47AD-8C8E-6D9A9D8BF8A8@gmail.com> On Aug 13, 2011, at 7:23 PM, Antoine Pitrou wrote: > On Sat, 13 Aug 2011 19:08:40 -0400 > Doug Hellmann wrote: >> >> Charles McLaughlin of Atlassian has set up mirrors of the Mercurial repositories hosted on python.org as part of the ongoing infrastructure improvement work. These mirrors will give us a public fail-over repository in the event that hg.python.org goes offline unexpectedly, and also provide features such as RSS feeds of changes for users interested in monitoring the repository passively. > > There is already an RSS feed at http://hg.python.org/cpython/rss-log > Another possibility is the gmane mirror of python-checkins, which has > its own RSS feed: http://rss.gmane.org/gmane.comp.python.cvs Thanks for the tip, I didn't know about either of those. Doug From ncoghlan at gmail.com Sun Aug 14 03:37:18 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 14 Aug 2011 11:37:18 +1000 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> <20110813174346.1a034a0d@pitrou.net> Message-ID: On Sun, Aug 14, 2011 at 9:26 AM, Guido van Rossum wrote: >> These days we have PyGILState_Ensure(): >> http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure >> >> and even dedicated documentation: >> http://docs.python.org/dev/c-api/init.html#non-python-created-threads >> >> ;) > > That is awesome! Although, if it's possible to arrange it, it's still better to do that once and then use BEGIN/END_ALLOW_THREADS to avoid the overhead of creating and destroying the temporary thread states: http://blog.ccpgames.com/kristjan/2011/06/23/temporary-thread-state-overhead/ Still, it's far, far easier than it used to be to handle the GIL correctly from non-Python created threads. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Sun Aug 14 03:42:15 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 14 Aug 2011 11:42:15 +1000 Subject: [Python-Dev] Fwd: Mirroring Python repos to Bitbucket In-Reply-To: References: <4E42DF4A.8010407@atlassian.com> Message-ID: On Sun, Aug 14, 2011 at 9:08 AM, Doug Hellmann wrote: > > Charles McLaughlin of Atlassian has set up mirrors of the Mercurial repositories hosted on python.org as part of the ongoing infrastructure improvement work. These mirrors will give us a public fail-over repository in the event that hg.python.org goes offline unexpectedly, and also provide features such as RSS feeds of changes for users interested in monitoring the repository passively. The main advantage of those mirrors to my mind is that it makes it easy for anyone to clone their own copy of the python.org repos without having to upload the whole thing to bitbucket themselves. That makes it easy for people to use a natural Mercurial workflow to develop and collaborate on patches, even for components other than the main CPython repo (e.g. the devguide or the benchmark suite). Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Sun Aug 14 03:44:53 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 14 Aug 2011 11:44:53 +1000 Subject: [Python-Dev] [Python-checkins] cpython: Monotonic, not monotonous In-Reply-To: References: Message-ID: On Sun, Aug 14, 2011 at 9:53 AM, antoine.pitrou wrote: > http://hg.python.org/cpython/rev/0273d0734593 > changeset: ? 71862:0273d0734593 > user: ? ? ? ?Antoine Pitrou > date: ? ? ? ?Sun Aug 14 01:51:52 2011 +0200 > summary: > ?Monotonic, not monotonous > > files: > ?Lib/test/pickletester.py | ?2 +- > ?1 files changed, 1 insertions(+), 1 deletions(-) I dunno, I reckon systematically testing pickles could get pretty monotonous, too ;) Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From sturla at molden.no Sun Aug 14 03:53:15 2011 From: sturla at molden.no (Sturla Molden) Date: Sun, 14 Aug 2011 03:53:15 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: <20110813174346.1a034a0d@pitrou.net> References: <4E44294F.5060005@molden.no> <20110812174226.0cd068b1@pitrou.net> <3F137782-3643-4077-92F7-519C55B921CC@stranden.com> <20110813174346.1a034a0d@pitrou.net> Message-ID: <4E472A8B.9040208@molden.no> Den 13.08.2011 17:43, skrev Antoine Pitrou: > These days we have PyGILState_Ensure(): > http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure > With the most recent Cython (0.15) we can just do: with gil: to ensure holding the GIL. And similarly from a thread holding the GIL with nogil: to temporarily release it. There are also some OpenMP support in Cython 0.15. OpenMP is much easier than messing around with threads manually (it moves all the hard parts of multithreading to the compiler). Now Cython almost makes it look Pythonic: http://docs.cython.org/src/userguide/parallelism.html Sturla From ziade.tarek at gmail.com Sun Aug 14 11:41:47 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Sun, 14 Aug 2011 11:41:47 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? Message-ID: Hi all, I am lacking of time right now to finish an important task before 3.2 final is out: we need to release "packaging" as a standalone release under Python 2.x and Python 3.1, to gather as much feedback as we can from more people. Doing an automated conversion turned out to be a nightmare, and I was about to go ahead and maintain a fork of the packaging package, with the few modules that are needed (sysconfig, etc) within a standalone release. I am looking for someone that has some free time and that is willing to lead this work. 3.2 can go out without this work of course, but it would be *much* better to have that feedback If you are interested, please let me know. Cheers Tarek -- Tarek Ziad? | http://ziade.org From martin at v.loewis.de Sun Aug 14 18:31:50 2011 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 14 Aug 2011 18:31:50 +0200 Subject: [Python-Dev] Python 3.2.2rc1 Message-ID: <4E47F876.7010105@v.loewis.de> On behalf of the Python development team, I'm happy to announce the first release candidate of the Python 3.2.2 maintenance release (3.2.2rc1). Python 3.2.2 fixes `a regression `_ in the ``urllib.request`` module that prevented opening many HTTP resources correctly with Python 3.2.1. Python 3.2 is a continuation of the efforts to improve and stabilize the Python 3.x line. Since the final release of Python 2.7, the 2.x line will only receive bugfixes, and new features are developed for 3.x only. Since PEP 3003, the Moratorium on Language Changes, is in effect, there are no changes in Python's syntax and built-in types in Python 3.2. Development efforts concentrated on the standard library and support for porting code to Python 3. Highlights are: * numerous improvements to the unittest module * PEP 3147, support for .pyc repository directories * PEP 3149, support for version tagged dynamic libraries * PEP 3148, a new futures library for concurrent programming * PEP 384, a stable ABI for extension modules * PEP 391, dictionary-based logging configuration * an overhauled GIL implementation that reduces contention * an extended email package that handles bytes messages * a much improved ssl module with support for SSL contexts and certificate hostname matching * a sysconfig module to access configuration information * additions to the shutil module, among them archive file support * many enhancements to configparser, among them mapping protocol support * improvements to pdb, the Python debugger * countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables) * many consistency and behavior fixes for numeric operations For a more extensive list of changes in 3.2, see http://docs.python.org/3.2/whatsnew/3.2.html To download Python 3.2 visit: http://www.python.org/download/releases/3.2/ Please consider trying Python 3.2 with your code and reporting any bugs you may notice to: http://bugs.python.org/ Enjoy! -- Martin v. L?wis (on behalf of the entire python-dev team and 3.2's contributors) From ncoghlan at gmail.com Mon Aug 15 01:20:44 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 15 Aug 2011 09:20:44 +1000 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: On Sun, Aug 14, 2011 at 7:41 PM, Tarek Ziad? wrote: > Hi all, > > I am lacking of time right now to finish an important task before 3.2 > final is out: If anyone else got at all confused by Tarek's email, s/3.x/3.x+1/ and it will all make sense (the mentioned release numbers in the 3.x series are all one lower than they should be - packaging is planned for 3.3, but a standalone library will allow feedback to be gathered from 2.x and 3.2 users before the API is 'locked in' for 3.3). How far has packaging diverged from distutils2, though? Wasn't that the planned venue for any backports in order to avoid name conflicts? Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From tjreedy at udel.edu Mon Aug 15 03:06:29 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 14 Aug 2011 21:06:29 -0400 Subject: [Python-Dev] Python 3.2.2rc1 In-Reply-To: <4E47F876.7010105@v.loewis.de> References: <4E47F876.7010105@v.loewis.de> Message-ID: On 8/14/2011 12:31 PM, "Martin v. L?wis" wrote: > On behalf of the Python development team, I'm happy to announce the > first release candidate of the Python 3.2.2 maintenance release (3.2.2rc1). > > Python 3.2.2 fixes `a regression`_ in the > ``urllib.request`` module that prevented opening many HTTP resources > correctly > with Python 3.2.1. It also has the fix for http://bugs.python.org/issue12540 as requested. Thank you. -- Terry Jan Reedy From brett at python.org Mon Aug 15 04:34:38 2011 From: brett at python.org (Brett Cannon) Date: Sun, 14 Aug 2011 19:34:38 -0700 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <20110811090242.1083782f@msiwind> References: <20110811090242.1083782f@msiwind> Message-ID: On Thu, Aug 11, 2011 at 00:02, Antoine Pitrou wrote: > Le Thu, 11 Aug 2011 03:34:37 +0200, > brian.curtin a ?crit : > > http://hg.python.org/cpython/rev/77a65b078852 > > changeset: 71809:77a65b078852 > > parent: 71803:1b4fae183da3 > > user: Brian Curtin > > date: Wed Aug 10 20:05:21 2011 -0500 > > summary: > > Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. > > > It would sound more useful to have a generic Py_RETURN() macro rather than > some specific forms for each and every common object. > Since the macro is rather generic, sure, but the name should probably be better since it doesn't necessarily convene the fact that a INCREF has occurred. So maybe Py_INCREF_RETURN()? > > Regards > > Antoine. > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/brett%40python.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Mon Aug 15 04:36:54 2011 From: benjamin at python.org (Benjamin Peterson) Date: Sun, 14 Aug 2011 21:36:54 -0500 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> Message-ID: 2011/8/14 Brett Cannon : > > > On Thu, Aug 11, 2011 at 00:02, Antoine Pitrou wrote: >> >> Le Thu, 11 Aug 2011 03:34:37 +0200, >> brian.curtin a ?crit : >> > http://hg.python.org/cpython/rev/77a65b078852 >> > changeset: ? 71809:77a65b078852 >> > parent: ? ? ?71803:1b4fae183da3 >> > user: ? ? ? ?Brian Curtin >> > date: ? ? ? ?Wed Aug 10 20:05:21 2011 -0500 >> > summary: >> > ? Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. >> >> >> It would sound more useful to have a generic Py_RETURN() macro rather than >> some specific forms for each and every common object. > > Since the macro is rather generic, sure, but the name should probably be > better since it doesn't necessarily convene the fact that a INCREF has > occurred. So maybe Py_INCREF_RETURN()? That nearly nullifies the space saving. I think that fact that it's a macro at all conveys that it does something else aside from "return x;". -- Regards, Benjamin From brett at python.org Mon Aug 15 04:39:44 2011 From: brett at python.org (Brett Cannon) Date: Sun, 14 Aug 2011 19:39:44 -0700 Subject: [Python-Dev] Backporting howto/pyporting to 2.7 In-Reply-To: <4E4559F8.7040507@netwok.org> References: <4E4559F8.7040507@netwok.org> Message-ID: Probably mostly hassle of maintaining changes between the two versions, but the doc will probably get more exposure that way. I say if you want to spearhead the backport, go for it. On Fri, Aug 12, 2011 at 09:51, ?ric Araujo wrote: > Hi everyone, > > I think it would be useful to have the ?Porting Python 2 Code to Python > 3? HOWTO in the 2.7 docs, as I think that a lot of users consult the 2.7 > docs. Is there any reason not to do it? > > Regards > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/brett%40python.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brett at python.org Mon Aug 15 04:45:49 2011 From: brett at python.org (Brett Cannon) Date: Sun, 14 Aug 2011 19:45:49 -0700 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> Message-ID: On Sun, Aug 14, 2011 at 19:36, Benjamin Peterson wrote: > 2011/8/14 Brett Cannon : > > > > > > On Thu, Aug 11, 2011 at 00:02, Antoine Pitrou > wrote: > >> > >> Le Thu, 11 Aug 2011 03:34:37 +0200, > >> brian.curtin a ?crit : > >> > http://hg.python.org/cpython/rev/77a65b078852 > >> > changeset: 71809:77a65b078852 > >> > parent: 71803:1b4fae183da3 > >> > user: Brian Curtin > >> > date: Wed Aug 10 20:05:21 2011 -0500 > >> > summary: > >> > Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. > >> > >> > >> It would sound more useful to have a generic Py_RETURN() macro rather > than > >> some specific forms for each and every common object. > > > > Since the macro is rather generic, sure, but the name should probably be > > better since it doesn't necessarily convene the fact that a INCREF has > > occurred. So maybe Py_INCREF_RETURN()? > > That nearly nullifies the space saving. I think that fact that it's a > macro at all conveys that it does something else aside from "return > x;". > This is C code; space savings went out the window along with gc a long time ago. Yes, being a macro helps differentiate semantics that a longer name is probably not needed. > > > -- > Regards, > Benjamin > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Mon Aug 15 04:48:45 2011 From: benjamin at python.org (Benjamin Peterson) Date: Sun, 14 Aug 2011 21:48:45 -0500 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> Message-ID: 2011/8/14 Brett Cannon : > > > On Sun, Aug 14, 2011 at 19:36, Benjamin Peterson > wrote: >> >> 2011/8/14 Brett Cannon : >> > >> > >> > On Thu, Aug 11, 2011 at 00:02, Antoine Pitrou >> > wrote: >> >> >> >> Le Thu, 11 Aug 2011 03:34:37 +0200, >> >> brian.curtin a ?crit : >> >> > http://hg.python.org/cpython/rev/77a65b078852 >> >> > changeset: ? 71809:77a65b078852 >> >> > parent: ? ? ?71803:1b4fae183da3 >> >> > user: ? ? ? ?Brian Curtin >> >> > date: ? ? ? ?Wed Aug 10 20:05:21 2011 -0500 >> >> > summary: >> >> > ? Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. >> >> >> >> >> >> It would sound more useful to have a generic Py_RETURN() macro rather >> >> than >> >> some specific forms for each and every common object. >> > >> > Since the macro is rather generic, sure, but the name should probably be >> > better since it doesn't necessarily convene the fact that a INCREF has >> > occurred. So maybe Py_INCREF_RETURN()? >> >> That nearly nullifies the space saving. I think that fact that it's a >> macro at all conveys that it does something else aside from "return >> x;". > > This is C code; space savings went out the window along with gc a long time > ago. I'm hanging on to it by a hair. :) -- Regards, Benjamin From ncoghlan at gmail.com Mon Aug 15 05:16:44 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 15 Aug 2011 13:16:44 +1000 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> Message-ID: On Mon, Aug 15, 2011 at 12:34 PM, Brett Cannon wrote: > On Thu, Aug 11, 2011 at 00:02, Antoine Pitrou wrote: >> It would sound more useful to have a generic Py_RETURN() macro rather than >> some specific forms for each and every common object. > > Since the macro is rather generic, sure, but the name should probably be > better since it doesn't necessarily convene the fact that a INCREF has > occurred. So maybe Py_INCREF_RETURN()? Aside from None and NotImplemented, do we really do the straight incref-and-return all that often? While I was initially attracted to the idea of a generic macro, the more I thought about it, the more it seemed like a magnet for reference leak bugs. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From brett at python.org Mon Aug 15 06:46:22 2011 From: brett at python.org (Brett Cannon) Date: Sun, 14 Aug 2011 21:46:22 -0700 Subject: [Python-Dev] Latest draft of PEP 399 (Pure Python/C Accelerator Module Compatibility Requirements) In-Reply-To: References: Message-ID: Since the latest draft went four weeks w/o comment or complaint to address the last round of issues, I am going to consider this PEP accepted (don't think we need a BDFAP since this is procedural and not a language feature or stdlib addition, but if people disagree then Guido can assign someone). On Sun, Jul 17, 2011 at 17:16, Brett Cannon wrote: > While at a mini-PyPy sprint w/ Alex Gaynor of PyPy and Phil Jenvey of > Jython, I decided to finally put the time in to update this PEP yet again. > > The biggest changes is that the 100% branch coverage requirement has been > replaced with "comprehensive" coverage from the tests. I think we are all > enough grown-ups to not have to specify anything tighter than this. I also > added a paragraph in the Details section about using the abstract C APIs > (e.g., PyObject_GetItem) over type-specific ones (e.g., PyList_GetItem) in > order to be more supportive of duck typing from the Python code. I figure > the "be API compatible" assumes this, but mentioning it doesn't hurt (and > should help make Raymond less angry =). > > > PEP: 399 > Title: Pure Python/C Accelerator Module Compatibility Requirements > Version: $Revision: 88219 $ > Last-Modified: $Date: 2011-01-27 13:47:00 -0800 (Thu, 27 Jan 2011) $ > Author: Brett Cannon > Status: Draft > Type: Informational > Content-Type: text/x-rst > Created: 04-Apr-2011 > Python-Version: 3.3 > Post-History: 04-Apr-2011, 12-Apr-2011, 17-Jul-2011 > > Abstract > ======== > > The Python standard library under CPython contains various instances > of modules implemented in both pure Python and C (either entirely or > partially). This PEP requires that in these instances that the > C code *must* pass the test suite used for the pure Python code > so as to act as much as a drop-in replacement as reasonably possible > (C- and VM-specific tests are exempt). It is also required that new > C-based modules lacking a pure Python equivalent implementation get > special permission to be added to the standard library. > > > Rationale > ========= > > Python has grown beyond the CPython virtual machine (VM). IronPython_, > Jython_, and PyPy_ are all currently viable alternatives to the > CPython VM. The VM ecosystem that has sprung up around the Python > programming language has led to Python being used in many different > areas where CPython cannot be used, e.g., Jython allowing Python to be > used in Java applications. > > A problem all of the VMs other than CPython face is handling modules > from the standard library that are implemented (to some extent) in C. > Since other VMs do not typically support the entire `C API of CPython`_ > they are unable to use the code used to create the module. Often times > this leads these other VMs to either re-implement the modules in pure > Python or in the programming language used to implement the VM itself > (e.g., in C# for IronPython). This duplication of effort between > CPython, PyPy, Jython, and IronPython is extremely unfortunate as > implementing a module *at least* in pure Python would help mitigate > this duplicate effort. > > The purpose of this PEP is to minimize this duplicate effort by > mandating that all new modules added to Python's standard library > *must* have a pure Python implementation _unless_ special dispensation > is given. This makes sure that a module in the stdlib is available to > all VMs and not just to CPython (pre-existing modules that do not meet > this requirement are exempt, although there is nothing preventing > someone from adding in a pure Python implementation retroactively). > > Re-implementing parts (or all) of a module in C (in the case > of CPython) is still allowed for performance reasons, but any such > accelerated code must pass the same test suite (sans VM- or C-specific > tests) to verify semantics and prevent divergence. To accomplish this, > the test suite for the module must have comprehensive coverage of the > pure Python implementation before the acceleration code may be added. > > > Details > ======= > > Starting in Python 3.3, any modules added to the standard library must > have a pure Python implementation. This rule can only be ignored if > the Python development team grants a special exemption for the module. > Typically the exemption will be granted only when a module wraps a > specific C-based library (e.g., sqlite3_). In granting an exemption it > will be recognized that the module will be considered exclusive to > CPython and not part of Python's standard library that other VMs are > expected to support. Usage of ``ctypes`` to provide an > API for a C library will continue to be frowned upon as ``ctypes`` > lacks compiler guarantees that C code typically relies upon to prevent > certain errors from occurring (e.g., API changes). > > Even though a pure Python implementation is mandated by this PEP, it > does not preclude the use of a companion acceleration module. If an > acceleration module is provided it is to be named the same as the > module it is accelerating with an underscore attached as a prefix, > e.g., ``_warnings`` for ``warnings``. The common pattern to access > the accelerated code from the pure Python implementation is to import > it with an ``import *``, e.g., ``from _warnings import *``. This is > typically done at the end of the module to allow it to overwrite > specific Python objects with their accelerated equivalents. This kind > of import can also be done before the end of the module when needed, > e.g., an accelerated base class is provided but is then subclassed by > Python code. This PEP does not mandate that pre-existing modules in > the stdlib that lack a pure Python equivalent gain such a module. But > if people do volunteer to provide and maintain a pure Python > equivalent (e.g., the PyPy team volunteering their pure Python > implementation of the ``csv`` module and maintaining it) then such > code will be accepted. In those instances the C version is considered > the reference implementation in terms of expected semantics. > > Any new accelerated code must act as a drop-in replacement as close > to the pure Python implementation as reasonable. Technical details of > the VM providing the accelerated code are allowed to differ as > necessary, e.g., a class being a ``type`` when implemented in C. To > verify that the Python and equivalent C code operate as similarly as > possible, both code bases must be tested using the same tests which > apply to the pure Python code (tests specific to the C code or any VM > do not follow under this requirement). The test suite is expected to > be extensive in order to verify expected semantics. > > Acting as a drop-in replacement also dictates that no public API be > provided in accelerated code that does not exist in the pure Python > code. Without this requirement people could accidentally come to rely > on a detail in the accelerated code which is not made available to > other VMs that use the pure Python implementation. To help verify > that the contract of semantic equivalence is being met, a module must > be tested both with and without its accelerated code as thoroughly as > possible. > > As an example, to write tests which exercise both the pure Python and > C accelerated versions of a module, a basic idiom can be followed:: > > import collections.abc > from test.support import import_fresh_module, run_unittest > import unittest > > c_heapq = import_fresh_module('heapq', fresh=['_heapq']) > py_heapq = import_fresh_module('heapq', blocked=['_heapq']) > > > class ExampleTest(unittest.TestCase): > > def test_heappop_exc_for_non_MutableSequence(self): > # Raise TypeError when heap is not a > # collections.abc.MutableSequence. > class Spam: > """Test class lacking many ABC-required methods > (e.g., pop()).""" > def __len__(self): > return 0 > > heap = Spam() > self.assertFalse(isinstance(heap, > collections.abc.MutableSequence)) > with self.assertRaises(TypeError): > self.heapq.heappop(heap) > > > class AcceleratedExampleTest(ExampleTest): > > """Test using the accelerated code.""" > > heapq = c_heapq > > > class PyExampleTest(ExampleTest): > > """Test with just the pure Python code.""" > > heapq = py_heapq > > > def test_main(): > run_unittest(AcceleratedExampleTest, PyExampleTest) > > > if __name__ == '__main__': > test_main() > > > If this test were to provide extensive coverage for > ``heapq.heappop()`` in the pure Python implementation then the > accelerated C code would be allowed to be added to CPython's standard > library. If it did not, then the test suite would need to be updated > until proper coverage was provided before the accelerated C code > could be added. > > To also help with compatibility, C code should use abstract APIs on > objects to prevent accidental dependence on specific types. For > instance, if a function accepts a sequence then the C code should > default to using `PyObject_GetItem()` instead of something like > `PyList_GetItem()`. C code is allowed to have a fast path if the > proper `PyList_Check()` is used, but otherwise APIs should work with > any object that duck types to the proper interface instead of a > specific type. > > > Copyright > ========= > > This document has been placed in the public domain. > > > .. _IronPython: http://ironpython.net/ > .. _Jython: http://www.jython.org/ > .. _PyPy: http://pypy.org/ > .. _C API of CPython: http://docs.python.org/py3k/c-api/index.html > .. _sqlite3: http://docs.python.org/py3k/library/sqlite3.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Mon Aug 15 06:50:19 2011 From: benjamin at python.org (Benjamin Peterson) Date: Sun, 14 Aug 2011 23:50:19 -0500 Subject: [Python-Dev] Latest draft of PEP 399 (Pure Python/C Accelerator Module Compatibility Requirements) In-Reply-To: References: Message-ID: 2011/8/14 Brett Cannon : >> proper `PyList_Check()` is used, but otherwise APIs should work with To be terribly nitty, what is probably wanted to is PyList_CheckExact. -- Regards, Benjamin From ziade.tarek at gmail.com Mon Aug 15 12:31:02 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Mon, 15 Aug 2011 12:31:02 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: On Mon, Aug 15, 2011 at 1:20 AM, Nick Coghlan wrote: > On Sun, Aug 14, 2011 at 7:41 PM, Tarek Ziad? wrote: >> Hi all, >> >> I am lacking of time right now to finish an important task before 3.2 >> final is out: > > If anyone else got at all confused by Tarek's email, s/3.x/3.x+1/ and > it will all make sense (the mentioned release numbers in the 3.x > series are all one lower than they should be - packaging is planned > for 3.3, but a standalone library will allow feedback to be gathered > from 2.x and 3.2 users before the API is 'locked in' for 3.3). Oh yeah sorry for the version mess up :) > How far has packaging diverged from distutils2, though? Wasn't that > the planned venue for any backports in order to avoid name conflicts? The plan is to provide under earlier versions of Python a standalone project that does not use the "packaging" namespace, but the "distutils2" namespace. IOW, the task to do is: 1/ copy packaging and all its stdlib dependencies in a standalone project 2/ rename packaging to distutils2 3/ make it work under older 2.x and 3.x (2.x would be the priority) <==== 4/ release it, promote its usage 5/ consolidate the API with the feedback received I realize it's by far the less interesting task to do in packaging, but it's by far one of the most important Cheers Tarek -- Tarek Ziad? | http://ziade.org From solipsis at pitrou.net Mon Aug 15 14:17:23 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 15 Aug 2011 14:17:23 +0200 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> Message-ID: <1313410643.3557.2.camel@localhost.localdomain> Le lundi 15 ao?t 2011 ? 13:16 +1000, Nick Coghlan a ?crit : > On Mon, Aug 15, 2011 at 12:34 PM, Brett Cannon wrote: > > On Thu, Aug 11, 2011 at 00:02, Antoine Pitrou wrote: > >> It would sound more useful to have a generic Py_RETURN() macro rather than > >> some specific forms for each and every common object. > > > > Since the macro is rather generic, sure, but the name should probably be > > better since it doesn't necessarily convene the fact that a INCREF has > > occurred. So maybe Py_INCREF_RETURN()? > > Aside from None and NotImplemented, do we really do the straight > incref-and-return all that often? AFAICT, often with True and False: x = (some condition) ? Py_True : Py_False; Py_INCREF(x); return x; Regards Antoine. From ncoghlan at gmail.com Mon Aug 15 14:35:08 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 15 Aug 2011 22:35:08 +1000 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <1313410643.3557.2.camel@localhost.localdomain> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> Message-ID: On Mon, Aug 15, 2011 at 10:17 PM, Antoine Pitrou wrote: > AFAICT, often with True and False: > > ? ?x = (some condition) ? Py_True : Py_False; > ? ?Py_INCREF(x); > ? ?return x; And that's an idiom that works better with a Py_RETURN macro than it would separate macros: Py_RETURN(cond ? Py_True : Py_False); OK, I'm persuaded that "Py_RETURN(Py_NotImplemented);" would be a better way to handle this change: +1 Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From raymond.hettinger at gmail.com Mon Aug 15 14:46:12 2011 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Mon, 15 Aug 2011 05:46:12 -0700 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> Message-ID: <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> On Aug 15, 2011, at 5:35 AM, Nick Coghlan wrote: > On Mon, Aug 15, 2011 at 10:17 PM, Antoine Pitrou wrote: >> AFAICT, often with True and False: >> >> x = (some condition) ? Py_True : Py_False; >> Py_INCREF(x); >> return x; > > And that's an idiom that works better with a Py_RETURN macro than it > would separate macros: > > Py_RETURN(cond ? Py_True : Py_False); > > OK, I'm persuaded that "Py_RETURN(Py_NotImplemented);" would be a > better way to handle this change: +1 I don't think that is worth it. There is some value to keeping the API consistent with the style that has been used in the past. So, I vote for Py_RETURN_NOTIMPLEMENTED. There's no real need to factor this any further. It's not hard and not important enough to introduce a new variation on return macros. Adding another return style makes the C API harder to learn and remember. If we we're starting from scratch, Py_RETURN(obj) would make sense. But we're not starting from scratch, so we should stick with the precedents. Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon Aug 15 14:48:07 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 15 Aug 2011 14:48:07 +0200 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> Message-ID: <20110815144807.2c607721@pitrou.net> On Mon, 15 Aug 2011 05:46:12 -0700 Raymond Hettinger wrote: > > I don't think that is worth it. > There is some value to keeping the API consistent with the style that has been used in the past. > So, I vote for Py_RETURN_NOTIMPLEMENTED. There's no real need to factor this any further. > It's not hard and not important enough to introduce a new variation on return macros. Why is Py_RETURN_NOTIMPLEMENTED important at all? Regards Antoine. From stefan_ml at behnel.de Mon Aug 15 15:21:43 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 15 Aug 2011 15:21:43 +0200 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> Message-ID: Nick Coghlan, 15.08.2011 14:35: > On Mon, Aug 15, 2011 at 10:17 PM, Antoine Pitrou wrote: >> AFAICT, often with True and False: >> >> x = (some condition) ? Py_True : Py_False; >> Py_INCREF(x); >> return x; > > And that's an idiom that works better with a Py_RETURN macro than it > would separate macros: > > Py_RETURN(cond ? Py_True : Py_False); And that would do what exactly? Duplicate the evaluation of the condition? Stefan From solipsis at pitrou.net Mon Aug 15 15:29:48 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 15 Aug 2011 15:29:48 +0200 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> Message-ID: <20110815152948.67209fcf@pitrou.net> On Mon, 15 Aug 2011 15:21:43 +0200 Stefan Behnel wrote: > Nick Coghlan, 15.08.2011 14:35: > > On Mon, Aug 15, 2011 at 10:17 PM, Antoine Pitrou wrote: > >> AFAICT, often with True and False: > >> > >> x = (some condition) ? Py_True : Py_False; > >> Py_INCREF(x); > >> return x; > > > > And that's an idiom that works better with a Py_RETURN macro than it > > would separate macros: > > > > Py_RETURN(cond ? Py_True : Py_False); > > And that would do what exactly? Duplicate the evaluation of the condition? You don't need to. #define Py_RETURN (x) do { \ PyObject *_tmp = (x); \ Py_INCREF(_tmp); \ return _tmp; \ } while(0) From barry at python.org Mon Aug 15 15:49:43 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 15 Aug 2011 09:49:43 -0400 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> Message-ID: <20110815094943.35f640b9@resist.wooz.org> On Aug 15, 2011, at 05:46 AM, Raymond Hettinger wrote: >I don't think that is worth it. There is some value to keeping the API >consistent with the style that has been used in the past. So, I vote for >Py_RETURN_NOTIMPLEMENTED. There's no real need to factor this any further. >It's not hard and not important enough to introduce a new variation on return >macros. Adding another return style makes the C API harder to learn and >remember. If we we're starting from scratch, Py_RETURN(obj) would make >sense. But we're not starting from scratch, so we should stick with the >precedents. I can see the small value in the convenience, but I tend to agree with Raymond here. I think we have to be careful about not descending into macro obfuscation world. -Barry From solipsis at pitrou.net Mon Aug 15 15:59:00 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 15 Aug 2011 15:59:00 +0200 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> Message-ID: <20110815155900.046f932b@pitrou.net> On Mon, 15 Aug 2011 09:49:43 -0400 Barry Warsaw wrote: > On Aug 15, 2011, at 05:46 AM, Raymond Hettinger wrote: > > >I don't think that is worth it. There is some value to keeping the API > >consistent with the style that has been used in the past. So, I vote for > >Py_RETURN_NOTIMPLEMENTED. There's no real need to factor this any further. > >It's not hard and not important enough to introduce a new variation on return > >macros. Adding another return style makes the C API harder to learn and > >remember. If we we're starting from scratch, Py_RETURN(obj) would make > >sense. But we're not starting from scratch, so we should stick with the > >precedents. > > I can see the small value in the convenience, but I tend to agree with Raymond > here. I think we have to be careful about not descending into macro > obfuscation world. How is Py_RETURN(Py_NotImplemented) more obfuscated than Py_RETURN_NOTIMPLEMENTED ??? From petri at digip.org Mon Aug 15 19:48:42 2011 From: petri at digip.org (Petri Lehtinen) Date: Mon, 15 Aug 2011 20:48:42 +0300 Subject: [Python-Dev] Fwd: Mirroring Python repos to Bitbucket In-Reply-To: References: <4E42DF4A.8010407@atlassian.com> Message-ID: <20110815174842.GA1598@ihaa> Doug Hellmann wrote: > > Charles McLaughlin of Atlassian has set up mirrors of the Mercurial > repositories hosted on python.org as part of the ongoing > infrastructure improvement work. These mirrors will give us a public > fail-over repository in the event that hg.python.org goes offline > unexpectedly, and also provide features such as RSS feeds of changes > for users interested in monitoring the repository passively. As a side note, for those preferring git there's also a very unofficial git mirror at https://github.com/jonashaag/cpython. It uses hg-git for converting and syncs once a day. Petri From tjreedy at udel.edu Mon Aug 15 20:17:16 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 15 Aug 2011 14:17:16 -0400 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <20110815094943.35f640b9@resist.wooz.org> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> Message-ID: On 8/15/2011 9:49 AM, Barry Warsaw wrote: > On Aug 15, 2011, at 05:46 AM, Raymond Hettinger wrote: > >> I don't think that is worth it. There is some value to keeping the API >> consistent with the style that has been used in the past. So, I vote for >> Py_RETURN_NOTIMPLEMENTED. There's no real need to factor this any further. >> It's not hard and not important enough to introduce a new variation on return >> macros. Adding another return style makes the C API harder to learn and >> remember. If we we're starting from scratch, Py_RETURN(obj) would make >> sense. But we're not starting from scratch, so we should stick with the >> precedents. > > I can see the small value in the convenience, but I tend to agree with Raymond > here. I think we have to be careful about not descending into macro > obfuscation world. Coming fresh to the C-API, as I partly am, I would rather have exactly 1 generally useful macro that increments the refcount of an object and returns it. To me, multiple special-case, seldom-used macros are a better example of 'macro obfuscation'. -- Terry Jan Reedy From alexander.belopolsky at gmail.com Mon Aug 15 21:04:02 2011 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Mon, 15 Aug 2011 15:04:02 -0400 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <20110811090242.1083782f@msiwind> References: <20110811090242.1083782f@msiwind> Message-ID: On Thu, Aug 11, 2011 at 3:02 AM, Antoine Pitrou wrote: .. >> ? Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. > > > It would sound more useful to have a generic Py_RETURN() macro rather than > some specific forms for each and every common object. Just my $0.02: I occasionally wish we had Py_RETURN_BOOL(1/0) instead of Py_RETURN_TRUE/FALSE, but I feel proposed Py_RETURN() is too ambiguous and should be called Py_RETURN_SINGLETON() or Py_RETURN_NEWREF(). Longer spelling, however makes it less attractive. Overall, I am -1 on Py_RETURN(). Introducing the second obvious way to spell Py_RETURN_NONE/TRUE/FALSE will clutter the API and novices may be misled into always using Py_RETURN(x) instead of return x attracting reference leaks. From alexandre at peadrop.com Mon Aug 15 21:56:14 2011 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Mon, 15 Aug 2011 12:56:14 -0700 Subject: [Python-Dev] PEP 3154 - pickle protocol 4 In-Reply-To: <20110812125846.00a75cd1@pitrou.net> References: <20110812125846.00a75cd1@pitrou.net> Message-ID: On Fri, Aug 12, 2011 at 3:58 AM, Antoine Pitrou wrote: > > Hello, > > This PEP is an attempt to foster a number of small incremental > improvements in a future pickle protocol version. The PEP process is > used in order to gather as many improvements as possible, because the > introduction of a new protocol version should be a rare occurrence. > > Feel free to suggest any additions. > > Your propositions sound all good to me. We will need to agree about the details, but I believe these improvements to the current protocol will be appreciated. Also, one thing keeps coming back is the need for pickling functions and methods which are not part of the global namespace (e.g. issue 9276). Support for this would likely help us fixing another related namespace issue (i.e., issue 3657 ). Finally, we currently missing support for pickling classes with __new__ taking keyword-only arguments (i.e. issue 4727 ). -- Alexandre -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Aug 16 00:32:09 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 16 Aug 2011 08:32:09 +1000 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <20110815155900.046f932b@pitrou.net> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> <20110815155900.046f932b@pitrou.net> Message-ID: On Mon, Aug 15, 2011 at 11:59 PM, Antoine Pitrou wrote: > On Mon, 15 Aug 2011 09:49:43 -0400 > Barry Warsaw wrote: >> I can see the small value in the convenience, but I tend to agree with Raymond >> here. ?I think we have to be careful about not descending into macro >> obfuscation world. > > How is Py_RETURN(Py_NotImplemented) more obfuscated than > Py_RETURN_NOTIMPLEMENTED ??? Indeed, this entire discussion was started by the extension of the Py_RETURN_NONE idiom to also adopt Py_RETURN_NOTIMPLEMENTED. If the idiom is to be extended at all, why stop there? Why not cover the Py_RETURN_TRUE and Py_RETURN_FALSE cases as well? Or, we can add exactly one new macro that covers all 3 cases, and others besides. I haven't encountered any complaints about people failing to understand the difference between "return Py_None;" and "Py_RETURN_NONE;" and see no major reason why "return x;" vs "Py_RETURN(x);" would be significantly more confusing. Based on this thread, there are actually two options I'd be fine with: 1. Just revert it and leave Py_RETURN_NONE as a special snowflake 2. Properly generalise the incref-and-return idiom via a Py_RETURN macro Incrementally increasing complexity by adding a second instance of the dedicated macro approach is precisely what we *shouldn't* be doing. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Tue Aug 16 00:37:11 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 16 Aug 2011 08:37:11 +1000 Subject: [Python-Dev] PEP 3154 - pickle protocol 4 In-Reply-To: References: <20110812125846.00a75cd1@pitrou.net> Message-ID: On Tue, Aug 16, 2011 at 5:56 AM, Alexandre Vassalotti wrote: > > On Fri, Aug 12, 2011 at 3:58 AM, Antoine Pitrou wrote: >> >> Hello, >> >> This PEP is an attempt to foster a number of small incremental >> improvements in a future pickle protocol version. The PEP process is >> used in order to gather as many improvements as possible, because the >> introduction of a new protocol version should be a rare occurrence. >> >> Feel free to suggest any additions. >> > > Your propositions sound all good to me. We will need to agree about the > details, but I believe these improvements to the current protocol will be > appreciated. > Also, one thing keeps coming back is the need for pickling functions and > methods which are not part of the global namespace (e.g. issue 9276). > Support for this would likely help us fixing another related namespace issue > (i.e.,?issue 3657). Finally, we currently missing support for pickling > classes with __new__ taking keyword-only arguments (i.e.?issue 4727). In the spirit of PEP 395 and python 3's pickle._compat_pickle, perhaps it would be worth looking at a mechanism whereby a pickle could specify "alternate class names" for included class instances in the pickle itself? Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From barry at python.org Tue Aug 16 00:43:22 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 15 Aug 2011 18:43:22 -0400 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> <20110815155900.046f932b@pitrou.net> Message-ID: <20110815184322.4a6df324@resist.wooz.org> On Aug 16, 2011, at 08:32 AM, Nick Coghlan wrote: >Indeed, this entire discussion was started by the extension of the >Py_RETURN_NONE idiom to also adopt Py_RETURN_NOTIMPLEMENTED. > >If the idiom is to be extended at all, why stop there? Why not cover >the Py_RETURN_TRUE and Py_RETURN_FALSE cases as well? > >Or, we can add exactly one new macro that covers all 3 cases, and >others besides. I haven't encountered any complaints about people >failing to understand the difference between "return Py_None;" and >"Py_RETURN_NONE;" and see no major reason why "return x;" vs >"Py_RETURN(x);" would be significantly more confusing. > >Based on this thread, there are actually two options I'd be fine with: >1. Just revert it and leave Py_RETURN_NONE as a special snowflake >2. Properly generalise the incref-and-return idiom via a Py_RETURN macro > >Incrementally increasing complexity by adding a second instance of the >dedicated macro approach is precisely what we *shouldn't* be doing. My problem with Py_RETURN(x) is that it's not clear that it also does an incref, and without that, I think it's *more* confusing to use rather than just writing it out explicitly, Py_RETURN_NONE's historic existence notwithstanding. So I'd opt for #1, unless we can agree on a better color for the bikeshed. -Barry From guido at python.org Tue Aug 16 00:52:00 2011 From: guido at python.org (Guido van Rossum) Date: Mon, 15 Aug 2011 15:52:00 -0700 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <20110815184322.4a6df324@resist.wooz.org> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> <20110815155900.046f932b@pitrou.net> <20110815184322.4a6df324@resist.wooz.org> Message-ID: On Mon, Aug 15, 2011 at 3:43 PM, Barry Warsaw wrote: > On Aug 16, 2011, at 08:32 AM, Nick Coghlan wrote: > >>Indeed, this entire discussion was started by the extension of the >>Py_RETURN_NONE idiom to also adopt Py_RETURN_NOTIMPLEMENTED. >> >>If the idiom is to be extended at all, why stop there? Why not cover >>the Py_RETURN_TRUE and Py_RETURN_FALSE cases as well? >> >>Or, we can add exactly one new macro that covers all 3 cases, and >>others besides. I haven't encountered any complaints about people >>failing to understand the difference between "return Py_None;" and >>"Py_RETURN_NONE;" and see no major reason why "return x;" vs >>"Py_RETURN(x);" would be significantly more confusing. >> >>Based on this thread, there are actually two options I'd be fine with: >>1. Just revert it and leave Py_RETURN_NONE as a special snowflake >>2. Properly generalise the incref-and-return idiom via a Py_RETURN macro >> >>Incrementally increasing complexity by adding a second instance of the >>dedicated macro approach is precisely what we *shouldn't* be doing. > > My problem with Py_RETURN(x) is that it's not clear that it also does an > incref, and without that, I think it's *more* confusing to use rather than > just writing it out explicitly, Py_RETURN_NONE's historic existence > notwithstanding. > > So I'd opt for #1, unless we can agree on a better color for the bikeshed. I dunno; if it *didn't* do an INCREF it would be a pretty pointless macro (just expanding to "return x") and I like reducing the clutter of a very common idiom. So I favor #2. -- --Guido van Rossum (python.org/~guido) From ethan at stoneleaf.us Tue Aug 16 01:13:50 2011 From: ethan at stoneleaf.us (Ethan Furman) Date: Mon, 15 Aug 2011 16:13:50 -0700 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <20110815184322.4a6df324@resist.wooz.org> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> <20110815155900.046f932b@pitrou.net> <20110815184322.4a6df324@resist.wooz.org> Message-ID: <4E49A82E.7080105@stoneleaf.us> Barry Warsaw wrote: > On Aug 16, 2011, at 08:32 AM, Nick Coghlan wrote: >> Based on this thread, there are actually two options I'd be fine with: >> 1. Just revert it and leave Py_RETURN_NONE as a special snowflake >> 2. Properly generalise the incref-and-return idiom via a Py_RETURN macro >> >> Incrementally increasing complexity by adding a second instance of the >> dedicated macro approach is precisely what we *shouldn't* be doing. > > My problem with Py_RETURN(x) is that it's not clear that it also does an > incref, and without that, I think it's *more* confusing to use rather than > just writing it out explicitly, Py_RETURN_NONE's historic existence > notwithstanding. > > So I'd opt for #1, unless we can agree on a better color for the bikeshed. My apologies if this is just noise, but are there RETURN macros that don't do an INCREF? ~Ethan~ From ncoghlan at gmail.com Tue Aug 16 01:39:09 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 16 Aug 2011 09:39:09 +1000 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: <4E49A82E.7080105@stoneleaf.us> References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> <20110815155900.046f932b@pitrou.net> <20110815184322.4a6df324@resist.wooz.org> <4E49A82E.7080105@stoneleaf.us> Message-ID: On Tue, Aug 16, 2011 at 9:13 AM, Ethan Furman wrote: > Barry Warsaw wrote: >> So I'd opt for #1, unless we can agree on a better color for the bikeshed. > > My apologies if this is just noise, but are there RETURN macros that don't > do an INCREF? No, Py_RETURN_NONE is the only previous example, and it was added to simplify the very common idiom of: Py_INCREF(Py_None); return Py_None; It was added originally because it helped to avoid *two* common bugs: return Py_None; # segfault waiting to happen return NULL; # Just plain wrong, but not picked up until tests are run and hence irritating I'd say NotImplemented is the second most common instance of that kind of direct incref-and-return (since operator methods need to return it to correctly support type coercion), although, as Antoine noted, Py_True and Py_False would be up there as well. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From guido at python.org Tue Aug 16 01:52:43 2011 From: guido at python.org (Guido van Rossum) Date: Mon, 15 Aug 2011 16:52:43 -0700 Subject: [Python-Dev] cpython: Add Py_RETURN_NOTIMPLEMENTED macro. Fixes #12724. In-Reply-To: References: <20110811090242.1083782f@msiwind> <1313410643.3557.2.camel@localhost.localdomain> <76C055DE-9433-41E9-A4B7-8ABDD433C29E@gmail.com> <20110815094943.35f640b9@resist.wooz.org> <20110815155900.046f932b@pitrou.net> <20110815184322.4a6df324@resist.wooz.org> <4E49A82E.7080105@stoneleaf.us> Message-ID: On Mon, Aug 15, 2011 at 4:39 PM, Nick Coghlan wrote: > On Tue, Aug 16, 2011 at 9:13 AM, Ethan Furman wrote: >> Barry Warsaw wrote: >>> So I'd opt for #1, unless we can agree on a better color for the bikeshed. >> >> My apologies if this is just noise, but are there RETURN macros that don't >> do an INCREF? > > No, Py_RETURN_NONE is the only previous example, and it was added to > simplify the very common idiom of: > > ? ?Py_INCREF(Py_None); > ? ?return Py_None; > > It was added originally because it helped to avoid *two* common bugs: > > ?return Py_None; # segfault waiting to happen > > ?return NULL; # Just plain wrong, but not picked up until tests are > run and hence irritating > > I'd say NotImplemented is the second most common instance of that kind > of direct incref-and-return (since operator methods need to return it > to correctly support type coercion), although, as Antoine noted, > Py_True and Py_False would be up there as well. I betcha if you extend your search to "return " preceded by "INCREF(variable)" you'll find a whole lot more examples. :-) -- --Guido van Rossum (python.org/~guido) From ncoghlan at gmail.com Tue Aug 16 04:35:48 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 16 Aug 2011 12:35:48 +1000 Subject: [Python-Dev] [Python-checkins] peps: Add Alexandre's suggestions In-Reply-To: References: Message-ID: On Tue, Aug 16, 2011 at 11:30 AM, antoine.pitrou wrote: > +Serializing "pseudo-global" objects > +----------------------------------- > + > +Objects which are not module-global, but should be treated in a similar > +fashion -- such as methods [4]_ or nested classes -- cannot currently be > +pickled (or, rather, unpickled) because the pickle protocol does not > +correctly specify how to retrieve them. ?One solution would be through the > +adjunction of a ``__namespace__`` (or ``__qualname__``) to all class and > +function objects, specifying the full "path" by which they can be retrieved. > +For globals, this would generally be ``"{}.{}".format(obj.__module__, obj.__name__)``. > +Then a new opcode can resolve that path and push the object on the stack, > +similarly to the GLOBAL opcode. > + I think this is the part that ties in with the pickle-related aspects for PEP 395 - using '__qualname__' would be one way to align a module's real name with where it should be retrieved from and where it's documentation lives (I like 'qualified name' as a term, too). Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From solipsis at pitrou.net Tue Aug 16 11:25:29 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 16 Aug 2011 11:25:29 +0200 Subject: [Python-Dev] [Python-checkins] peps: Add Alexandre's suggestions References: Message-ID: <20110816112529.15fb6c69@pitrou.net> On Tue, 16 Aug 2011 12:35:48 +1000 Nick Coghlan wrote: > On Tue, Aug 16, 2011 at 11:30 AM, antoine.pitrou > wrote: > > +Serializing "pseudo-global" objects > > +----------------------------------- > > + > > +Objects which are not module-global, but should be treated in a similar > > +fashion -- such as methods [4]_ or nested classes -- cannot currently be > > +pickled (or, rather, unpickled) because the pickle protocol does not > > +correctly specify how to retrieve them. ?One solution would be through the > > +adjunction of a ``__namespace__`` (or ``__qualname__``) to all class and > > +function objects, specifying the full "path" by which they can be retrieved. > > +For globals, this would generally be ``"{}.{}".format(obj.__module__, obj.__name__)``. > > +Then a new opcode can resolve that path and push the object on the stack, > > +similarly to the GLOBAL opcode. > > + > > I think this is the part that ties in with the pickle-related aspects > for PEP 395 - using '__qualname__' would be one way to align a > module's real name with where it should be retrieved from and where > it's documentation lives (I like 'qualified name' as a term, too). Oops, I admit I hadn't read PEP 395. PEP 395 focuses on module aliasing, while the suggestion above focuses on the path of objects in modules. How can we reconcile the two? Do we want __qualname__ to be a relative "path" inside the module? (but then __qualname__ cannot specify its own module name). Regards Antoine. From ncoghlan at gmail.com Tue Aug 16 12:15:51 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 16 Aug 2011 20:15:51 +1000 Subject: [Python-Dev] [Python-checkins] peps: Add Alexandre's suggestions In-Reply-To: <20110816112529.15fb6c69@pitrou.net> References: <20110816112529.15fb6c69@pitrou.net> Message-ID: On Tue, Aug 16, 2011 at 7:25 PM, Antoine Pitrou wrote: > On Tue, 16 Aug 2011 12:35:48 +1000 > Nick Coghlan wrote: >> On Tue, Aug 16, 2011 at 11:30 AM, antoine.pitrou >> wrote: >> > +Serializing "pseudo-global" objects >> > +----------------------------------- >> > + >> > +Objects which are not module-global, but should be treated in a similar >> > +fashion -- such as methods [4]_ or nested classes -- cannot currently be >> > +pickled (or, rather, unpickled) because the pickle protocol does not >> > +correctly specify how to retrieve them. ?One solution would be through the >> > +adjunction of a ``__namespace__`` (or ``__qualname__``) to all class and >> > +function objects, specifying the full "path" by which they can be retrieved. >> > +For globals, this would generally be ``"{}.{}".format(obj.__module__, obj.__name__)``. >> > +Then a new opcode can resolve that path and push the object on the stack, >> > +similarly to the GLOBAL opcode. >> > + >> >> I think this is the part that ties in with the pickle-related aspects >> for PEP 395 - using '__qualname__' ?would be one way to align a >> module's real name with where it should be retrieved from and where >> it's documentation lives (I like 'qualified name' as a term, too). > > Oops, I admit I hadn't read PEP 395. > PEP 395 focuses on module aliasing, while the suggestion above focuses > on the path of objects in modules. How can we reconcile the two? Do we > want __qualname__ to be a relative "path" inside the module? > (but then __qualname__ cannot specify its own module name). I was more thinking that if pickle grew the ability to handle two different names for objects, then PEP 395 could run off the same feature without having to mess with sys.modules. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From solipsis at pitrou.net Tue Aug 16 13:23:44 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 16 Aug 2011 13:23:44 +0200 Subject: [Python-Dev] peps: Add Alexandre's suggestions In-Reply-To: References: <20110816112529.15fb6c69@pitrou.net> Message-ID: <20110816132344.6d64aca7@pitrou.net> On Tue, 16 Aug 2011 20:15:51 +1000 Nick Coghlan wrote: > > > > Oops, I admit I hadn't read PEP 395. > > PEP 395 focuses on module aliasing, while the suggestion above focuses > > on the path of objects in modules. How can we reconcile the two? Do we > > want __qualname__ to be a relative "path" inside the module? > > (but then __qualname__ cannot specify its own module name). > > I was more thinking that if pickle grew the ability to handle two > different names for objects, then PEP 395 could run off the same > feature without having to mess with sys.modules. But what happens if a module contains, say, a nested class with a __qualname__ (assigned by the interpreter) of "module_name.A.B", and the module later gets a __qualname__ (assigned by the user) of "module_alias"? Regards Antoine. From ncoghlan at gmail.com Tue Aug 16 13:37:31 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 16 Aug 2011 21:37:31 +1000 Subject: [Python-Dev] [Python-checkins] peps: Add Alexandre's suggestions In-Reply-To: <20110816132344.6d64aca7@pitrou.net> References: <20110816112529.15fb6c69@pitrou.net> <20110816132344.6d64aca7@pitrou.net> Message-ID: On Tue, Aug 16, 2011 at 9:23 PM, Antoine Pitrou wrote: > On Tue, 16 Aug 2011 20:15:51 +1000 > Nick Coghlan wrote: >> > >> > Oops, I admit I hadn't read PEP 395. >> > PEP 395 focuses on module aliasing, while the suggestion above focuses >> > on the path of objects in modules. How can we reconcile the two? Do we >> > want __qualname__ to be a relative "path" inside the module? >> > (but then __qualname__ cannot specify its own module name). >> >> I was more thinking that if pickle grew the ability to handle two >> different names for objects, then PEP 395 could run off the same >> feature without having to mess with sys.modules. > > But what happens if a module contains, say, a nested class with a > __qualname__ (assigned by the interpreter) of "module_name.A.B", and the > module later gets a __qualname__ (assigned by the user) of > "module_alias"? Yeah, I don't think it works with PEP 395 in its current state. But then, I'm not sure 395 will work at all in its current state - definitely a work in progress, that one. However, I'll definitely keep this aspect in mind next time I update it - even if they don't use the same mechanism, they should at least be compatible proposals. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From chris at simplistix.co.uk Wed Aug 17 00:58:24 2011 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 16 Aug 2011 15:58:24 -0700 Subject: [Python-Dev] Sphinx version for Python 2.x docs Message-ID: <4E4AF610.5040303@simplistix.co.uk> Hi All, Any chance the version of sphinx used to generate the docs on docs.python.org could be updated? I'd love to take advantage of the "new format" intersphinx mapping: http://sphinx.pocoo.org/ext/intersphinx.html#confval-intersphinx_mapping ...but since it looks like docs.python.org uses a version of sphinx that's too old for that, I can't like to: :ref:`Foo ` ...and have to link to: `LogRecord attributes `__ instead :-S cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From sandro.tosi at gmail.com Wed Aug 17 01:05:58 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Wed, 17 Aug 2011 01:05:58 +0200 Subject: [Python-Dev] Sphinx version for Python 2.x docs In-Reply-To: <4E4AF610.5040303@simplistix.co.uk> References: <4E4AF610.5040303@simplistix.co.uk> Message-ID: Hello Chris, On Wed, Aug 17, 2011 at 00:58, Chris Withers wrote: > Hi All, > > Any chance the version of sphinx used to generate the docs on > docs.python.org could be updated? I think what's needed first is to run a pilot: take the current 2.7 doc, update sphinx and look at what breaks, and evaluate if it's fixable in a reasonable amount of time, or it's just too much and so on. Currently no-one has done that yet: would you ? :) That would helps up quite much Cheers, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From chris at simplistix.co.uk Wed Aug 17 04:08:47 2011 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 16 Aug 2011 19:08:47 -0700 Subject: [Python-Dev] Sphinx version for Python 2.x docs In-Reply-To: References: <4E4AF610.5040303@simplistix.co.uk> Message-ID: <4E4B22AF.3090805@simplistix.co.uk> On 16/08/2011 16:05, Sandro Tosi wrote: > Hello Chris, > > On Wed, Aug 17, 2011 at 00:58, Chris Withers wrote: >> Hi All, >> >> Any chance the version of sphinx used to generate the docs on >> docs.python.org could be updated? > > I think what's needed first is to run a pilot: take the current 2.7 > doc, Where does that live? Where are the instructions for building the docs? (dependencies needed, etc) cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk From ncoghlan at gmail.com Wed Aug 17 05:59:07 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 17 Aug 2011 13:59:07 +1000 Subject: [Python-Dev] Sphinx version for Python 2.x docs In-Reply-To: <4E4B22AF.3090805@simplistix.co.uk> References: <4E4AF610.5040303@simplistix.co.uk> <4E4B22AF.3090805@simplistix.co.uk> Message-ID: On Wed, Aug 17, 2011 at 12:08 PM, Chris Withers wrote: > On 16/08/2011 16:05, Sandro Tosi wrote: >> I think what's needed first is to run a pilot: take the current 2.7 >> doc, > > Where does that live? > Where are the instructions for building the docs? (dependencies needed, etc) 'make html' in the Docs directory of a CPython checkout ("hg clone http://hg.python.org/cpython") usually does the trick. See http://docs.python.org/dev/documenting/building.html for more detail if the above doesn't work. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From sturla at molden.no Wed Aug 17 17:22:22 2011 From: sturla at molden.no (Sturla Molden) Date: Wed, 17 Aug 2011 17:22:22 +0200 Subject: [Python-Dev] GIL removal question In-Reply-To: References: <6474188B-CC1C-4B33-9E6C-9D2ACFC637D9@dabeaz.com> <9E289F07-B0DA-47E6-B46C-22FAE17D4A0D@dabeaz.com> Message-ID: <4E4BDCAE.7070202@molden.no> Den 10.08.2011 13:43, skrev Guido van Rossum: > They have a specific plan, based on Software Transactional Memory: > http://morepypy.blogspot.com/2011/06/global-interpreter-lock-or-how-to-kill.html > Microsoft's experiment to use STM in .NET failed though. And Linux got rid of the BKL without STM. There is a similar but simpler paradim called "bulk synchronous parallel" (BSP) which might work too. Threads work independently for a particular amount of time with private objects (e.g. copy-on-write memory), then enter a barrier, changes to global objects are synchronized and the GC collects garbage, after which worker threads leave the barrier, and the cycle repeats. To communicate changes to shared objects between synchronization barriers, Python code must use explicit locks and flush statements. But for the C code in the interpreter, BSP should give the same atomicity for Python bytecodes as the GIL (there is just one active thread inside the barrier). BSP is much simpler to implement than STM because of the barrier synchronization. BSP also cannot deadlock or livelock. And because threads in BSP work with private memory, there will be no trashing (false sharing) from the reference counting GC. Sturla From vinay_sajip at yahoo.co.uk Thu Aug 18 00:30:39 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Wed, 17 Aug 2011 22:30:39 +0000 (UTC) Subject: [Python-Dev] Packaging in Python 2 anyone ? References: Message-ID: Tarek Ziad? gmail.com> writes: > IOW, the task to do is: > > 1/ copy packaging and all its stdlib dependencies in a standalone project > 2/ rename packaging to distutils2 > 3/ make it work under older 2.x and 3.x (2.x would be the priority) <==== > 4/ release it, promote its usage > 5/ consolidate the API with the feedback received > > I realize it's by far the less interesting task to do in packaging, > but it's by far one of the most important Okay, I had a bit of spare time today, and here's as far as I've got: Step 1 - done. Step 2 - done. Step 3 - On Python 2.6 most of the tests pass: Ran 322 tests in 12.148s FAILED (failures=3, errors=4, skipped=39) See the detailed test results (for Linux) at https://gist.github.com/1152791 The code is at https://bitbucket.org/vinay.sajip/du2/ stdlib dependency code is either moved to util.py or test/support.py as appropriate. You need to test in a virtualenv with unittest2 installed. No work has been done on packaging the project. I'm not sure if I'll have much more time to spend on this, but it's there in case someone else can look at the remaining test failures, plus Steps 4 and 5; hopefully I've broken the back of it :-) Regards, Vinay Sajip From chrism at plope.com Thu Aug 18 03:15:45 2011 From: chrism at plope.com (Chris McDonough) Date: Wed, 17 Aug 2011 21:15:45 -0400 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: <1313630145.3775.0.camel@thinko> I'll throw this out there.. why is it going to have a different name on python2 than on python3? - C On Wed, 2011-08-17 at 22:30 +0000, Vinay Sajip wrote: > Tarek Ziad? gmail.com> writes: > > > IOW, the task to do is: > > > > 1/ copy packaging and all its stdlib dependencies in a standalone project > > 2/ rename packaging to distutils2 > > 3/ make it work under older 2.x and 3.x (2.x would be the priority) <==== > > 4/ release it, promote its usage > > 5/ consolidate the API with the feedback received > > > > I realize it's by far the less interesting task to do in packaging, > > but it's by far one of the most important > > Okay, I had a bit of spare time today, and here's as far as I've got: > > Step 1 - done. > Step 2 - done. > Step 3 - On Python 2.6 most of the tests pass: > > Ran 322 tests in 12.148s > > FAILED (failures=3, errors=4, skipped=39) > > See the detailed test results (for Linux) at https://gist.github.com/1152791 > > The code is at https://bitbucket.org/vinay.sajip/du2/ > > stdlib dependency code is either moved to util.py or test/support.py as > appropriate. You need to test in a virtualenv with unittest2 installed. No work > has been done on packaging the project. > > I'm not sure if I'll have much more time to spend on this, but it's there in > case someone else can look at the remaining test failures, plus Steps 4 and 5; > hopefully I've broken the back of it :-) > > Regards, > > Vinay Sajip > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40plope.com From fdrake at acm.org Thu Aug 18 04:15:53 2011 From: fdrake at acm.org (Fred Drake) Date: Wed, 17 Aug 2011 22:15:53 -0400 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: <1313630145.3775.0.camel@thinko> References: <1313630145.3775.0.camel@thinko> Message-ID: On Wed, Aug 17, 2011 at 9:15 PM, Chris McDonough wrote: > I'll throw this out there.. why is it going to have a different name on > python2 than on python3? So it can be a drop-in replacement for the existing distutils2, I'd expect. "packaging" is new with Python3, and is the Guido-approved name. -Fred -- Fred L. Drake, Jr.? ? "A person who won't read has no advantage over one who can't read." ?? --Samuel Langhorne Clemens From ncoghlan at gmail.com Thu Aug 18 05:00:39 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 18 Aug 2011 13:00:39 +1000 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: <1313630145.3775.0.camel@thinko> Message-ID: On Thu, Aug 18, 2011 at 12:15 PM, Fred Drake wrote: > On Wed, Aug 17, 2011 at 9:15 PM, Chris McDonough wrote: >> I'll throw this out there.. why is it going to have a different name on >> python2 than on python3? > > So it can be a drop-in replacement for the existing distutils2, I'd expect. > > "packaging" is new with Python3, and is the Guido-approved name. It's actually for the same reason that unittest changes are backported under the unittest2 name - the distutils2 name can be used in the future to get Python 3.4 packaging features in Python 3.3, but that would be difficult if the backport shadowed the standard library name. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From fdrake at acm.org Thu Aug 18 05:17:38 2011 From: fdrake at acm.org (Fred Drake) Date: Wed, 17 Aug 2011 23:17:38 -0400 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: <1313630145.3775.0.camel@thinko> Message-ID: On Wed, Aug 17, 2011 at 11:00 PM, Nick Coghlan wrote: > It's actually for the same reason that unittest changes are backported > under the unittest2 name - the distutils2 name can be used in the > future to get Python 3.4 packaging features in Python 3.3, but that > would be difficult if the backport shadowed the standard library name. Ah, yes... the old "too bad we stuck it in the standard library" problem. For some things, an easy lament, but for foundational packaging-related things, it's hard to get around. -Fred -- Fred L. Drake, Jr.? ? "A person who won't read has no advantage over one who can't read." ?? --Samuel Langhorne Clemens From ziade.tarek at gmail.com Thu Aug 18 09:32:45 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 18 Aug 2011 09:32:45 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: <1313630145.3775.0.camel@thinko> Message-ID: On Thu, Aug 18, 2011 at 5:17 AM, Fred Drake wrote: > On Wed, Aug 17, 2011 at 11:00 PM, Nick Coghlan wrote: >> It's actually for the same reason that unittest changes are backported >> under the unittest2 name - the distutils2 name can be used in the >> future to get Python 3.4 packaging features in Python 3.3, but that >> would be difficult if the backport shadowed the standard library name. > > Ah, yes... the old "too bad we stuck it in the standard library" problem. > > For some things, an easy lament, but for foundational packaging-related > things, it's hard to get around. Yeah exactly. And the good thing about packaging and distutils2 is that for the regular usage (package your project) you don't type any code, just define options in setup.cfg. IOW there's no "import packaging" or "import distutils2". Cheers Tarek > > > ?-Fred > > -- > Fred L. Drake, Jr.? ? > "A person who won't read has no advantage over one who can't read." > ?? --Samuel Langhorne Clemens > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/ziade.tarek%40gmail.com > -- Tarek Ziad? | http://ziade.org From ziade.tarek at gmail.com Thu Aug 18 09:35:01 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 18 Aug 2011 09:35:01 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: On Thu, Aug 18, 2011 at 12:30 AM, Vinay Sajip wrote: ... > Okay, I had a bit of spare time today, and here's as far as I've got: Awesome, thanks a lot ! > > Step 1 - done. > Step 2 - done. > Step 3 - On Python 2.6 most of the tests pass: > > Ran 322 tests in 12.148s > > FAILED (failures=3, errors=4, skipped=39) > > See the detailed test results (for Linux) at https://gist.github.com/1152791 > > The code is at https://bitbucket.org/vinay.sajip/du2/ > > stdlib dependency code is either moved to util.py or test/support.py as > appropriate. You need to test in a virtualenv with unittest2 installed. No work > has been done on packaging the project. > > I'm not sure if I'll have much more time to spend on this, but it's there in > case someone else can look at the remaining test failures, plus Steps 4 and 5; > hopefully I've broken the back of it :-) Thank you very much ! Ideally, if you could push this to hg.python.org/distutils2 (overwriting the existing stuff). Cheers Tarek -- Tarek Ziad? | http://ziade.org From vinay_sajip at yahoo.co.uk Thu Aug 18 11:16:21 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Thu, 18 Aug 2011 09:16:21 +0000 (UTC) Subject: [Python-Dev] Packaging in Python 2 anyone ? References: Message-ID: Tarek Ziad? gmail.com> writes: > Ideally, if you could push this to hg.python.org/distutils2 > (overwriting the existing stuff). Okay, done. I've overwritten existing files and added new ones, only removing/renaming things like index -> pypi and mkcfg -> create. I haven't touched existing code e.g. the top-level test scripts or the _backport directory. The added test_distutils2.py is what I used to run the tests. Regards, Vinay Sajip From solipsis at pitrou.net Thu Aug 18 11:26:40 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 18 Aug 2011 11:26:40 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? References: Message-ID: <20110818112640.3cfa1455@pitrou.net> On Thu, 18 Aug 2011 09:16:21 +0000 (UTC) Vinay Sajip wrote: > Tarek Ziad? gmail.com> writes: > > > Ideally, if you could push this to hg.python.org/distutils2 > > (overwriting the existing stuff). > > Okay, done. I've overwritten existing files and added new ones, only > removing/renaming things like index -> pypi and mkcfg -> create. I haven't > touched existing code e.g. the top-level test scripts or the _backport > directory. The added test_distutils2.py is what I used to run the tests. Impressive work! That said, I'm not sure it was the best moment to backport, since test_packaging currently fails under Windows (I think ?ric is supposed to look at it). Regards Antoine. From ziade.tarek at gmail.com Thu Aug 18 11:37:32 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 18 Aug 2011 11:37:32 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: <20110818112640.3cfa1455@pitrou.net> References: <20110818112640.3cfa1455@pitrou.net> Message-ID: On Thu, Aug 18, 2011 at 11:26 AM, Antoine Pitrou wrote: > On Thu, 18 Aug 2011 09:16:21 +0000 (UTC) > Vinay Sajip wrote: >> Tarek Ziad? gmail.com> writes: >> >> > Ideally, if you could push this to hg.python.org/distutils2 >> > (overwriting the existing stuff). >> >> Okay, done. I've overwritten existing files and added new ones, only >> removing/renaming things like index -> pypi and mkcfg -> create. I haven't >> touched existing code e.g. the top-level test scripts or the _backport >> directory. The added test_distutils2.py is what I used to run the tests. > > Impressive work! > That said, I'm not sure it was the best moment to backport, since > test_packaging currently fails under Windows (I think ?ric is supposed > to look at it). > Frankly, I think there's no best moment for this. We'll need to backport everything we do in packaging/ in distutils2/ (Yeah, painful...) > Regards > > Antoine. > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/ziade.tarek%40gmail.com > -- Tarek Ziad? | http://ziade.org From ziade.tarek at gmail.com Thu Aug 18 11:38:12 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 18 Aug 2011 11:38:12 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: On Thu, Aug 18, 2011 at 11:16 AM, Vinay Sajip wrote: > Tarek Ziad? gmail.com> writes: > >> Ideally, if you could push this to hg.python.org/distutils2 >> (overwriting the existing stuff). > > Okay, done. I've overwritten existing files and added new ones, only > removing/renaming things like index -> pypi and mkcfg -> create. I haven't > touched existing code e.g. the top-level test scripts or the _backport > directory. The added test_distutils2.py is what I used to run the tests. Thanks again > Regards, > > Vinay Sajip > > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/ziade.tarek%40gmail.com > -- Tarek Ziad? | http://ziade.org From vinay_sajip at yahoo.co.uk Thu Aug 18 18:19:12 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Thu, 18 Aug 2011 16:19:12 +0000 (UTC) Subject: [Python-Dev] Packaging in Python 2 anyone ? References: <20110818112640.3cfa1455@pitrou.net> Message-ID: Antoine Pitrou pitrou.net> writes: > That said, I'm not sure it was the best moment to backport, since > test_packaging currently fails under Windows (I think ?ric is supposed > to look at it). Plus, there are at least half a dozen issues which would need to be addressed in packaging before final release, but they are not complete show-stoppers and won't preclude 2.x users giving useful feedback. Regards, Vinay Sajip From stefan at bytereef.org Thu Aug 18 18:22:54 2011 From: stefan at bytereef.org (Stefan Krah) Date: Thu, 18 Aug 2011 18:22:54 +0200 Subject: [Python-Dev] memoryview: "B", "c", "b" format specifiers Message-ID: <20110818162254.GA18925@sleipnir.bytereef.org> Hello, during my work on PEP-3118 fixes I noticed that memoryview does not handle the "B" format specifier according to the struct module documentation: Here's what struct does: >>> b = bytearray([1,2,3]) >>> struct.pack_into('B', b, 0, b'X') Traceback (most recent call last): File "", line 1, in struct.error: required argument is not an integer >>> struct.pack_into('c', b, 0, b'X') >>> b bytearray(b'X\x02\x03') Here's what memoryview does: >>> b = bytearray([1,2,3]) >>> m = memoryview(b) >>> m.format 'B' >>> m[0] = b'X' >>> m[0] = 3 Traceback (most recent call last): File "", line 1, in TypeError: 'int' does not support the buffer interface So, memoryview does exactly the opposite of what is specified. It should reject the bytes object but accept the integer. I would like to fix this in the features/pep-3118 repository as follows: - memoryview should respect the format specifiers. - bytearray and friends should set the format specifier to "c" in their getbuffer() methods. - Introduce a new function PyMemoryView_FromBytes() that can be used instead of PyMemoryView_FromBuffer(). PyMemoryView_FromBuffer() is usually used in conjunction with PyBuffer_FillInfo(), which sets the format specifier to "B". Are there any general objections to this? Stefan Krah From solipsis at pitrou.net Thu Aug 18 18:40:40 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 18 Aug 2011 18:40:40 +0200 Subject: [Python-Dev] memoryview: "B", "c", "b" format specifiers References: <20110818162254.GA18925@sleipnir.bytereef.org> Message-ID: <20110818184040.26dcfcee@pitrou.net> On Thu, 18 Aug 2011 18:22:54 +0200 Stefan Krah wrote: > > So, memoryview does exactly the opposite of what is specified. It should > reject the bytes object but accept the integer. Well, memoryview is quite dumb right now. It ignores the format and just considers its underlying memory a bytes sequence. > I would like to fix this in the features/pep-3118 repository as follows: > > - memoryview should respect the format specifiers. > > - bytearray and friends should set the format specifier to "c" > in their getbuffer() methods. > > - Introduce a new function PyMemoryView_FromBytes() that can be used > instead of PyMemoryView_FromBuffer(). PyMemoryView_FromBuffer() > is usually used in conjunction with PyBuffer_FillInfo(), which > sets the format specifier to "B". What would PyMemoryView_FromBytes() do? The name suggests it takes a bytes object, but you can already use PyMemoryView_FromObject() for that. (I personnaly think the general bytes-as-sequence-of-ints behaviour is a mistake, so I wouldn't care much about an additional C API to enforce that behaviour :-)) Regards Antoine. From stefan at bytereef.org Thu Aug 18 18:57:00 2011 From: stefan at bytereef.org (Stefan Krah) Date: Thu, 18 Aug 2011 18:57:00 +0200 Subject: [Python-Dev] memoryview: "B", "c", "b" format specifiers In-Reply-To: <20110818184040.26dcfcee@pitrou.net> References: <20110818162254.GA18925@sleipnir.bytereef.org> <20110818184040.26dcfcee@pitrou.net> Message-ID: <20110818165700.GA19118@sleipnir.bytereef.org> Antoine Pitrou wrote: > > I would like to fix this in the features/pep-3118 repository as follows: > > > > - memoryview should respect the format specifiers. > > > > - bytearray and friends should set the format specifier to "c" > > in their getbuffer() methods. > > > > - Introduce a new function PyMemoryView_FromBytes() that can be used > > instead of PyMemoryView_FromBuffer(). PyMemoryView_FromBuffer() > > is usually used in conjunction with PyBuffer_FillInfo(), which > > sets the format specifier to "B". > > What would PyMemoryView_FromBytes() do? The name suggests it takes a > bytes object, but you can already use PyMemoryView_FromObject() for > that. Oh no, the name isn't quite right then. It should be a replacement for the combination PyBuffer_FillInfo()/PyMemoryView_FromBuffer() and it should temporarily wrap a C-string. Also, unlike that combination, it would set the format specifier to "c". Perhaps this name is better: PyObject * PyMemoryView_FromCString(char *s, Py_ssize_t size, int flags); 'flags' is just PyBUF_READ or PyBUF_WRITE. In the Python source tree, it could completely replace PyBuffer_FillInfo() and PyMemoryView_FromBuffer(). Stefan Krah From fijall at gmail.com Thu Aug 18 19:31:06 2011 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 18 Aug 2011 19:31:06 +0200 Subject: [Python-Dev] PyPy 1.6 released Message-ID: ======================== PyPy 1.6 - kickass panda ======================== We're pleased to announce the 1.6 release of PyPy. This release brings a lot of bugfixes and performance improvements over 1.5, and improves support for Windows 32bit and OS X 64bit. This version fully implements Python 2.7.1 and has beta level support for loading CPython C extensions. You can download it here: http://pypy.org/download.html What is PyPy? ============= PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.1. It's fast (`pypy 1.5 and cpython 2.6.2`_ performance comparison) due to its integrated tracing JIT compiler. This release supports x86 machines running Linux 32/64 or Mac OS X. Windows 32 is beta (it roughly works but a lot of small issues have not been fixed so far). Windows 64 is not yet supported. The main topics of this release are speed and stability: on average on our benchmark suite, PyPy 1.6 is between **20% and 30%** faster than PyPy 1.5, which was already much faster than CPython on our set of benchmarks. The speed improvements have been made possible by optimizing many of the layers which compose PyPy. In particular, we improved: the Garbage Collector, the JIT warmup time, the optimizations performed by the JIT, the quality of the generated machine code and the implementation of our Python interpreter. .. _`pypy 1.5 and cpython 2.6.2`: http://speed.pypy.org Highlights ========== * Numerous performance improvements, overall giving considerable speedups: - better GC behavior when dealing with very large objects and arrays - **fast ctypes:** now calls to ctypes functions are seen and optimized by the JIT, and they are up to 60 times faster than PyPy 1.5 and 10 times faster than CPython - improved generators(1): simple generators now are inlined into the caller loop, making performance up to 3.5 times faster than PyPy 1.5. - improved generators(2): thanks to other optimizations, even generators that are not inlined are between 10% and 20% faster than PyPy 1.5. - faster warmup time for the JIT - JIT support for single floats (e.g., for ``array('f')``) - optimized dictionaries: the internal representation of dictionaries is now dynamically selected depending on the type of stored objects, resulting in faster code and smaller memory footprint. For example, dictionaries whose keys are all strings, or all integers. Other dictionaries are also smaller due to bugfixes. * JitViewer: this is the first official release which includes the JitViewer, a web-based tool which helps you to see which parts of your Python code have been compiled by the JIT, down until the assembler. The `jitviewer`_ 0.1 has already been release and works well with PyPy 1.6. * The CPython extension module API has been improved and now supports many more extensions. For information on which one are supported, please refer to our `compatibility wiki`_. * Multibyte encoding support: this was of of the last areas in which we were still behind CPython, but now we fully support them. * Preliminary support for NumPy: this release includes a preview of a very fast NumPy module integrated with the PyPy JIT. Unfortunately, this does not mean that you can expect to take an existing NumPy program and run it on PyPy, because the module is still unfinished and supports only some of the numpy API. However, barring some details, what works should be blazingly fast :-) * Bugfixes: since the 1.5 release we fixed 53 bugs in our `bug tracker`_, not counting the numerous bugs that were found and reported through other channels than the bug tracker. Cheers, Hakan Ardo, Carl Friedrich Bolz, Laura Creighton, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Alex Gaynor, Armin Rigo and the PyPy team .. _`jitviewer`: http://morepypy.blogspot.com/2011/08/visualization-of-jitted-code.html .. _`bug tracker`: https://bugs.pypy.org .. _`compatibility wiki`: https://bitbucket.org/pypy/compatibility/wiki/Home From merwok at netwok.org Thu Aug 18 20:02:45 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Thu, 18 Aug 2011 20:02:45 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: <20110818112640.3cfa1455@pitrou.net> Message-ID: <4E4D53C5.8000608@netwok.org> Le 18/08/2011 18:19, Vinay Sajip a ?crit : > Antoine Pitrou pitrou.net> writes: >> That said, I'm not sure it was the best moment to backport, since >> test_packaging currently fails under Windows (I think ?ric is supposed >> to look at it). I will; any help is welcome, especially if you have a machine with the same Windows version (see #12678). I caught Georg?s message on python-committers but could not do anything in time; I only have Internet access at a public library so I can?t be as responsive as I would. > Plus, there are at least half a dozen issues which would need to be addressed in > packaging before final release, but they are not complete show-stoppers and > won't preclude 2.x users giving useful feedback. Yes, there are a few dozen bugs that need addressing before 1.0 (i.e. Python 3.3), but there?s time. Alpha and beta releases of distutils2 would be useful. Regards From solipsis at pitrou.net Thu Aug 18 20:19:16 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 18 Aug 2011 20:19:16 +0200 Subject: [Python-Dev] cpython (3.2): NUL -> NULL References: Message-ID: <20110818201916.35218b7e@pitrou.net> On Thu, 18 Aug 2011 17:49:28 +0200 benjamin.peterson wrote: > - PyErr_SetString(PyExc_TypeError, "embedded NUL character"); > + PyErr_SetString(PyExc_TypeError, "embedded NULL character"); Are you sure? IIRC, NUL is the little name of ASCII character 0 (while NULL would be the NULL pointer). Regards Antoine. From eric at trueblade.com Thu Aug 18 20:25:36 2011 From: eric at trueblade.com (Eric V. Smith) Date: Thu, 18 Aug 2011 14:25:36 -0400 Subject: [Python-Dev] cpython (3.2): NUL -> NULL In-Reply-To: <20110818201916.35218b7e@pitrou.net> References: <20110818201916.35218b7e@pitrou.net> Message-ID: <4E4D5920.4080408@trueblade.com> On 08/18/2011 02:19 PM, Antoine Pitrou wrote: > On Thu, 18 Aug 2011 17:49:28 +0200 > benjamin.peterson wrote: >> - PyErr_SetString(PyExc_TypeError, "embedded NUL character"); >> + PyErr_SetString(PyExc_TypeError, "embedded NULL character"); > > Are you sure? IIRC, NUL is the little name of ASCII character 0 > (while NULL would be the NULL pointer). That's my understanding, too. Eric. From merwok at netwok.org Thu Aug 18 20:27:30 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Thu, 18 Aug 2011 20:27:30 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: <4E4D5992.7070603@netwok.org> Hi Tarek, > Doing an automated conversion turned out to be a nightmare, and I was > about to go ahead and maintain a fork of the packaging package, with > the few modules that are needed (sysconfig, etc) within a standalone > release. Can you give us more info? Do you have a repo somewhere, or notes? A related question: what is the minimum 2.x version that we should support? 2.6 would be a dream, thanks to bytes literal and all that, but I?m sure it?s not realistic; 2.5 would be nice for the with statement and hashlib, otherwise 2.4 is okay. When I talked with ?ukasz in private email about backports and 3to2, we agreed that there were some serious bugs in 3to2 and we wanted to work on patches. I also wanted to make the command-line driver more flexible, so that it would be easy to run a command to apply only 3.3?3.2 fixes, then another for 3.2?2.7, etc. Maybe your problems were caused by the state of the packaging codebase. The conversion to 3.x was a little messy: in some cases there were parallel code branches for 2.x and 3.x, on other cases 2to3 was run, and in many cases the conversion had to be cleaned up (esp. bytes/str madness). Even now that the code runs and the tests pass, there may still be things in need of a cleanup in the codebase, and maybe they trip up 3to2. > I am looking for someone that has some free time and that is willing > to lead this work. Well, free time is scarce with all these distutils bugs on my plate, but I am definitely interested in heading the backport, as I stated earlier. I think the key point is to avoid making the same work over and over again, and I see a few ways of managing that. The first way is to start with a 2.x-converted codebase (thanks Vinay!) and manually port all cpython/packaging changesets to distutils2, like I used to do. This is just as annoying as backporting to 2.7, and just as simple. The second way is to work on a conversion tool instead of working on changesets. The idea is to make a robust tool based on 3to2 that copies code and converts it. This would not be the easiest way, as shown by your experience, but surely the less cumbersome in the long term. The third way is to use a new Mercurial repo converted from the cpython repo, so that we can run ?hg convert? again to pull new changesets. Convert, test and commit. The advantage is that it?s not required to port each changeset: the convert-merge dance can be done once a month, or just for new releases. The fourth way is hybrid: start from a 2.x-converted codebase, and each month, make a diff for cpython/Lib/packaging and apply to distutils2. I fear that such diffs would be painful to apply, and consist mostly of rejects. With idea #3, we get to use a merge tool, which is much better. After writing out these ideas, I think the first one is certainly the simplest thing that could work with minimum pain. Le 18/08/2011 00:30, Vinay Sajip a ?crit : > stdlib dependency code is either moved to util.py or test/support.py as > appropriate. We need sysconfig, shutil, tarfile, hashlib... Surely that?s a lot to put in util.py. > I'm not sure if I'll have much more time to spend on this, but it's there in > case someone else can look at the remaining test failures, plus Steps 4 and 5; > hopefully I've broken the back of it :-) I join my thanks to Tarek?s, and volunteer to follow on :) Regards From benjamin at python.org Thu Aug 18 20:41:15 2011 From: benjamin at python.org (Benjamin Peterson) Date: Thu, 18 Aug 2011 13:41:15 -0500 Subject: [Python-Dev] cpython (3.2): NUL -> NULL In-Reply-To: <20110818201916.35218b7e@pitrou.net> References: <20110818201916.35218b7e@pitrou.net> Message-ID: 2011/8/18 Antoine Pitrou : > On Thu, 18 Aug 2011 17:49:28 +0200 > benjamin.peterson wrote: >> - ? ? ? ?PyErr_SetString(PyExc_TypeError, "embedded NUL character"); >> + ? ? ? ?PyErr_SetString(PyExc_TypeError, "embedded NULL character"); > > Are you sure? IIRC, NUL is the little name of ASCII character 0 > (while NULL would be the NULL pointer). NUL is the abbreviation of the "Null character". -- Regards, Benjamin From stefan at bytereef.org Thu Aug 18 20:51:21 2011 From: stefan at bytereef.org (Stefan Krah) Date: Thu, 18 Aug 2011 20:51:21 +0200 Subject: [Python-Dev] cpython (3.2): NUL -> NULL In-Reply-To: <20110818201916.35218b7e@pitrou.net> References: <20110818201916.35218b7e@pitrou.net> Message-ID: <20110818185121.GA19783@sleipnir.bytereef.org> Antoine Pitrou wrote: > On Thu, 18 Aug 2011 17:49:28 +0200 > benjamin.peterson wrote: > > - PyErr_SetString(PyExc_TypeError, "embedded NUL character"); > > + PyErr_SetString(PyExc_TypeError, "embedded NULL character"); > > Are you sure? IIRC, NUL is the little name of ASCII character 0 > (while NULL would be the NULL pointer). Yes, that's the traditional name. I was surprised that the C99 standard uses "null character" in almost all cases. Example: "The construction '\0' is commonly used to represent the null character." So I think it should be either NUL or "null character" with the lower case spelling. Stefan Krah From ziade.tarek at gmail.com Thu Aug 18 20:59:00 2011 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 18 Aug 2011 20:59:00 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: <4E4D5992.7070603@netwok.org> References: <4E4D5992.7070603@netwok.org> Message-ID: On Thu, Aug 18, 2011 at 8:27 PM, ?ric Araujo wrote: > Hi Tarek, > >> Doing an automated conversion turned out to be a nightmare, and I was >> about to go ahead and maintain a fork of the packaging package, with >> the few modules that are needed (sysconfig, etc) within a standalone >> release. > > Can you give us more info? ?Do you have a repo somewhere, or notes? I tried using relative imports, but that made the whole thing complicated and not working under older 2.x then there are a lot of spots where the word 'packaging' is used for other things than modules. then there are spots when we needed to change the bytes/str behavior depending on the py version, making everything complex to maintain I guess it's the addition of the three that made it too complex : transparent renaming + 3to2 + 3.xto3.x > > A related question: what is the minimum 2.x version that we should > support? ?2.6 would be a dream, thanks to bytes literal and all that, > but I?m sure it?s not realistic; 2.5 would be nice for the with > statement and hashlib, otherwise 2.4 is okay. 2.5 sounds good. I am sold on dropping 2.4 frankly. Maybe we can drop 2.5 in a few months ;) > > When I talked with ?ukasz in private email about backports and 3to2, we > agreed that there were some serious bugs in 3to2 and we wanted to work > on patches. ?I also wanted to make the command-line driver more > flexible, so that it would be easy to run a command to apply only > 3.3?3.2 fixes, then another for 3.2?2.7, etc. > > Maybe your problems were caused by the state of the packaging codebase. > ?The conversion to 3.x was a little messy: in some cases there were > parallel code branches for 2.x and 3.x, on other cases 2to3 was run, and > in many cases the conversion had to be cleaned up (esp. bytes/str > madness). ?Even now that the code runs and the tests pass, there may > still be things in need of a cleanup in the codebase, and maybe they > trip up 3to2. I think that's not worth the effort frankly. keeping a clean fully py3 code without worrying about making it 3to2 friendly, make all contributors life easier ihmo. The tradeoff is that we will have to backport to distutils2 changes. That's what was done for a while between the Python trunk and the Py3k branch, so I guess it's doable -- if all packaging contributors agree to do this backport work. > >> I am looking for someone that has some free time and that is willing >> to lead this work. > > Well, free time is scarce with all these distutils bugs on my plate, but > I am definitely interested in heading the backport, as I stated earlier. > ?I think the key point is to avoid making the same work over and over > again, and I see a few ways of managing that. > > The first way is to start with a 2.x-converted codebase (thanks Vinay!) > and manually port all cpython/packaging changesets to distutils2, like I > used to do. ?This is just as annoying as backporting to 2.7, and just as > simple. > > The second way is to work on a conversion tool instead of working on > changesets. ?The idea is to make a robust tool based on 3to2 that copies > code and converts it. ?This would not be the easiest way, as shown by > your experience, but surely the less cumbersome in the long term. > > The third way is to use a new Mercurial repo converted from the cpython > repo, so that we can run ?hg convert? again to pull new changesets. > Convert, test and commit. ?The advantage is that it?s not required to > port each changeset: the convert-merge dance can be done once a month, > or just for new releases. > > The fourth way is hybrid: start from a 2.x-converted codebase, and each > month, make a diff for cpython/Lib/packaging and apply to distutils2. ?I > fear that such diffs would be painful to apply, and consist mostly of > rejects. ?With idea #3, we get to use a merge tool, which is much better. > > After writing out these ideas, I think the first one is certainly the > simplest thing that could work with minimum pain. I think so too. The automatic conversion sounded like a great thing, but the nature of the project makes it too hard, Cheers -- Tarek Ziad? | http://ziade.org From stefan at bytereef.org Thu Aug 18 22:25:59 2011 From: stefan at bytereef.org (Stefan Krah) Date: Thu, 18 Aug 2011 22:25:59 +0200 Subject: [Python-Dev] memoryview: "B", "c", "b" format specifiers In-Reply-To: <20110818184040.26dcfcee@pitrou.net> References: <20110818162254.GA18925@sleipnir.bytereef.org> <20110818184040.26dcfcee@pitrou.net> Message-ID: <20110818202559.GA20296@sleipnir.bytereef.org> Antoine Pitrou wrote: > (I personnaly think the general bytes-as-sequence-of-ints behaviour is > a mistake, so I wouldn't care much about an additional C API to enforce > that behaviour :-)) I don't want to abolish the "c" (bytes of length 1) format. :) I think there are use cases for well defined arrays of small signed/unsigned integers. Say you want to send a log-ngram array of unsigned chars over the network. There shouldn't be a bytes object involved in that process. You would pack the array with ints and unpack as ints. Unless the struct module and PEP-3118 grow support for int8_t and uint8_t, I think "b" and "B" should probably be restricted to integers. Stefan Krah From solipsis at pitrou.net Thu Aug 18 22:53:06 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 18 Aug 2011 22:53:06 +0200 Subject: [Python-Dev] memoryview: "B", "c", "b" format specifiers References: <20110818162254.GA18925@sleipnir.bytereef.org> <20110818184040.26dcfcee@pitrou.net> <20110818165700.GA19118@sleipnir.bytereef.org> Message-ID: <20110818225306.76d11bfe@pitrou.net> On Thu, 18 Aug 2011 18:57:00 +0200 Stefan Krah wrote: > > Oh no, the name isn't quite right then. It should be a replacement > for the combination PyBuffer_FillInfo()/PyMemoryView_FromBuffer() > and it should temporarily wrap a C-string. Ah, nice. > PyObject * PyMemoryView_FromCString(char *s, Py_ssize_t size, int flags); It's not really a C string, since it's not null-terminated. PyMemoryView_FromMemory? (that would mirror PyUnicode_FromUnicode, for example) > 'flags' is just PyBUF_READ or PyBUF_WRITE. Why do we have these in addition to PyBUF_WRITABLE already? Regards Antoine. From vinay_sajip at yahoo.co.uk Thu Aug 18 23:49:52 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Thu, 18 Aug 2011 21:49:52 +0000 (UTC) Subject: [Python-Dev] Packaging in Python 2 anyone ? References: <4E4D5992.7070603@netwok.org> Message-ID: ?ric Araujo netwok.org> writes: > Le 18/08/2011 00:30, Vinay Sajip a ?crit : > > stdlib dependency code is either moved to util.py or test/support.py as > > appropriate. > We need sysconfig, shutil, tarfile, hashlib... Surely that?s a lot to > put in util.py. Well sysconfig.py/sysconfig.cfg have been copied as is. I've only copied over specific things we need from shutil/functools/os, etc. so far to util.py. I haven't looked at 2.4/2.5 support yet: things like hashlib would probably need to be treated the same way Django handles this sort of backport of functionality. > I join my thanks to Tarek?s, and volunteer to follow on :) That's good news :-) Regards, Vinay Sajip From stefan at bytereef.org Fri Aug 19 00:30:46 2011 From: stefan at bytereef.org (Stefan Krah) Date: Fri, 19 Aug 2011 00:30:46 +0200 Subject: [Python-Dev] memoryview: "B", "c", "b" format specifiers In-Reply-To: <20110818225306.76d11bfe@pitrou.net> References: <20110818162254.GA18925@sleipnir.bytereef.org> <20110818184040.26dcfcee@pitrou.net> <20110818165700.GA19118@sleipnir.bytereef.org> <20110818225306.76d11bfe@pitrou.net> Message-ID: <20110818223046.GA20738@sleipnir.bytereef.org> Antoine Pitrou wrote: > On Thu, 18 Aug 2011 18:57:00 +0200 > Stefan Krah wrote: > > > > Oh no, the name isn't quite right then. It should be a replacement > > for the combination PyBuffer_FillInfo()/PyMemoryView_FromBuffer() > > and it should temporarily wrap a C-string. > > Ah, nice. > > > PyObject * PyMemoryView_FromCString(char *s, Py_ssize_t size, int flags); > > It's not really a C string, since it's not null-terminated. > PyMemoryView_FromMemory? > > (that would mirror PyUnicode_FromUnicode, for example) I see, yes. PyMemoryView_FromStringAndSize()? No, too much typing. I prefer PyMemoryView_FromMemory(). > > 'flags' is just PyBUF_READ or PyBUF_WRITE. > > Why do we have these in addition to PyBUF_WRITABLE already? That's a bit involved, this is how I see it: There are four buffer *request* flags that can be sent to a buffer provider and that indicate the amount of complexity that a consumer can handle (in decreasing order): PyBUF_INDIRECT -> suboffsets (PIL-style) PyBUF_STRIDES -> strides (Numpy-style) PyBUF_ND -> C-contiguous, but possibly multi-dimensional PyBUF_SIMPLE -> contiguous, one-dimensional, unsigned bytes Each of those flags can be mixed freely with two additional flags: PyBUF_WRITABLE PyBUF_FORMAT All other buffer request flags are simply combinations of those. For example, if you use PyBUF_WRITABLE as the only flag, logically it should be seen as PyBUF_WRITABLE|PyBUF_SIMPLE (this works since PyBUF_SIMPLE is defined as 0). PyBUF_READ and PyBUF_WRITE are so far only used for PyMemoryView_GetContiguous(). The PEP still has a flag named PyBUF_UPDATEIFCOPY, but that didn't make it into object.h. I thought it might be appropriate to use PyBUF_READ and PyBUF_WRITE to underline the fact that you cannot send a fine grained buffer request to PyMemoryView_FromMemory()[1]. Also, PyBUF_READ is easier to understand than PyBUF_SIMPLE. But I'd be equally happy with PyBUF_SIMPLE/PyBUF_WRITABLE. Stefan Krah [1] The terminology might sound funny, but there is a function that can act a micro buffer provider: int PyBuffer_FillInfo(Py_buffer *view, PyObject *obj, void *buf, Py_ssize_t len, int readonly, int infoflags) An exporter can use this function as a building block for a getbuffer() method for unsigned bytes, since it reacts correctly to *all* possible buffer requests in 'infoflags'. From ncoghlan at gmail.com Fri Aug 19 03:30:48 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 19 Aug 2011 11:30:48 +1000 Subject: [Python-Dev] cpython (3.2): NUL -> NULL In-Reply-To: <20110818185121.GA19783@sleipnir.bytereef.org> References: <20110818201916.35218b7e@pitrou.net> <20110818185121.GA19783@sleipnir.bytereef.org> Message-ID: On Fri, Aug 19, 2011 at 4:51 AM, Stefan Krah wrote: > So I think it should be either NUL or "null character" with the lower > case spelling. +1 Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From p.f.moore at gmail.com Fri Aug 19 17:35:29 2011 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 19 Aug 2011 16:35:29 +0100 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: On 15 August 2011 11:31, Tarek Ziad? wrote: > IOW, the task to do is: > > 1/ copy packaging and all its stdlib dependencies in a standalone project > 2/ rename packaging to distutils2 > 3/ make it work under older 2.x and 3.x (2.x would be the priority) ?<==== > 4/ release it, promote its usage > 5/ consolidate the API with the feedback received One thing that I, as a semi-interested bystander, would like to see is sort of a component of 4. Namely, a document somewhere addressing the question of why I, as a current user of distutils (setup.py, etc), should convert my project to use packaging/distutils2 - and what I'd need to do so. At the moment, I see no benefit to me in migrating. New projects, or projects that already know that they want one or more of the benefits that packaging/distutils2/setuptools bring, are a different matter. It's projects with needs satisfied by distutils, and code invested a distutils-based solution, that could do with some persuasion. I checked the docs, and "Distributing Python Modules" is for new projects, and "What's new" basically says "we expect you to migrate" but has no reasons or guidelines. If someone borrows the time machine and makes this already available, so much the better. Pointers would be appreciated! Paul. From merwok at netwok.org Fri Aug 19 17:40:28 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Fri, 19 Aug 2011 17:40:28 +0200 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: References: Message-ID: <4E4E83EC.8040708@netwok.org> > One thing that I, as a semi-interested bystander, would like to see is > sort of a component of 4. Namely, a document somewhere addressing the > question of why I, as a current user of distutils (setup.py, etc), > should convert my project to use packaging/distutils2 - and what I'd > need to do so. I?m working on such a document, first in a doc set outside of the Python docs, and when it?s ready as an official HOWTO. (I?ll send the URL when I finish and publish it.) > I checked the docs, and "Distributing Python Modules" is for new > projects, That doc set is for distutils, unless you meant ?Distributing Python Projects?, which is currently under severe updating. Regards From p.f.moore at gmail.com Fri Aug 19 17:56:37 2011 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 19 Aug 2011 16:56:37 +0100 Subject: [Python-Dev] Packaging in Python 2 anyone ? In-Reply-To: <4E4E83EC.8040708@netwok.org> References: <4E4E83EC.8040708@netwok.org> Message-ID: On 19 August 2011 16:40, ?ric Araujo wrote: >> One thing that I, as a semi-interested bystander, would like to see is >> sort of a component of 4. Namely, a document somewhere addressing the >> question of why I, as a current user of distutils (setup.py, etc), >> should convert my project to use packaging/distutils2 - and what I'd >> need to do so. > > I?m working on such a document, first in a doc set outside of the Python > docs, and when it?s ready as an official HOWTO. ?(I?ll send the URL when > I finish and publish it.) Nice :-) I'll try to provide some feedback when it's ready. >> I checked the docs, and "Distributing Python Modules" is for new >> projects, > > That doc set is for distutils, unless you meant ?Distributing Python > Projects?, which is currently under severe updating. Sorry, I did indeed mean "... Projects" - I had looked at the Python 3.3 doc tree, and hadn't noticed that the name had changed. Paul. From status at bugs.python.org Fri Aug 19 18:07:22 2011 From: status at bugs.python.org (Python tracker) Date: Fri, 19 Aug 2011 18:07:22 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20110819160722.C3BE21CA8D@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2011-08-12 - 2011-08-19) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 2937 (+14) closed 21630 (+28) total 24567 (+42) Open issues with patches: 1266 Issues opened (31) ================== #12409: Moving "Documenting Python" to Devguide http://bugs.python.org/issue12409 reopened by eric.araujo #12745: Python2 or Python3 page http://bugs.python.org/issue12745 opened by JBernardo #12746: normalization is affected by unicode width http://bugs.python.org/issue12746 opened by benjamin.peterson #12749: lib re cannot match non-BMP ranges (all versions, all builds) http://bugs.python.org/issue12749 opened by tchrist #12750: datetime.strftime('%s') should respect tzinfo http://bugs.python.org/issue12750 opened by Daniel.O'Connor #12753: \N{...} neglects formal aliases and named sequences from Unico http://bugs.python.org/issue12753 opened by tchrist #12754: Add alternative random number generators http://bugs.python.org/issue12754 opened by rhettinger #12757: undefined name in doctest.py http://bugs.python.org/issue12757 opened by georg.brandl #12758: time.time() returns local time instead of UTC http://bugs.python.org/issue12758 opened by maksbotan #12759: "(?P=)" input for Tools/scripts/redemo.py throw an exception http://bugs.python.org/issue12759 opened by fredeom #12760: Add create mode to open() http://bugs.python.org/issue12760 opened by David.Townshend #12761: Typo in Doc/license.rst http://bugs.python.org/issue12761 opened by jwilk #12762: EnvironmentError_str contributes to unportable code http://bugs.python.org/issue12762 opened by jwilk #12764: segfault in ctypes.Struct with bad _fields_ http://bugs.python.org/issue12764 opened by amaury.forgeotdarc #12765: test_packaging failure under Snow Leopard http://bugs.python.org/issue12765 opened by pitrou #12767: document threading.Condition.notify http://bugs.python.org/issue12767 opened by eli.bendersky #12768: docstrings for the threading module http://bugs.python.org/issue12768 opened by eli.bendersky #12769: String with NUL characters truncated by ctypes when assigning http://bugs.python.org/issue12769 opened by Rafal.Dowgird #12771: 2to3 -d adds extra whitespace http://bugs.python.org/issue12771 opened by VPeric #12772: fractional day attribute in datetime class http://bugs.python.org/issue12772 opened by Miguel.de.Val.Borro #12774: Warning -- multiprocessing.process._dangling was modified by t http://bugs.python.org/issue12774 opened by ned.deily #12775: immense performance problems related to the garbage collector http://bugs.python.org/issue12775 opened by dsvensson #12776: argparse: type conversion function should be called only once http://bugs.python.org/issue12776 opened by arnau #12777: Inconsistent use of VOLUME_NAME_* with GetFinalPathNameByHandl http://bugs.python.org/issue12777 opened by pitrou #12778: JSON-serializing a large container takes too much memory http://bugs.python.org/issue12778 opened by pitrou #12779: Update packaging documentation http://bugs.python.org/issue12779 opened by eric.araujo #12780: Clean up tests for pyc/pyo in __file__ http://bugs.python.org/issue12780 opened by eric.araujo #12781: Mention SO_REUSEADDR near socket doc examples http://bugs.python.org/issue12781 opened by sandro.tosi #12782: Multiple context expressions do not support parentheses for co http://bugs.python.org/issue12782 opened by Julian #12783: test_posix failure on FreeBSD 6.4: test_get_and_set_scheduler_ http://bugs.python.org/issue12783 opened by neologix #12785: list_distinfo_file is wrong http://bugs.python.org/issue12785 opened by eric.araujo Most recent 15 issues with no replies (15) ========================================== #12785: list_distinfo_file is wrong http://bugs.python.org/issue12785 #12783: test_posix failure on FreeBSD 6.4: test_get_and_set_scheduler_ http://bugs.python.org/issue12783 #12782: Multiple context expressions do not support parentheses for co http://bugs.python.org/issue12782 #12776: argparse: type conversion function should be called only once http://bugs.python.org/issue12776 #12772: fractional day attribute in datetime class http://bugs.python.org/issue12772 #12771: 2to3 -d adds extra whitespace http://bugs.python.org/issue12771 #12768: docstrings for the threading module http://bugs.python.org/issue12768 #12759: "(?P=)" input for Tools/scripts/redemo.py throw an exception http://bugs.python.org/issue12759 #12742: Add support for CESU-8 encoding http://bugs.python.org/issue12742 #12739: read stuck with multithreading and simultaneous subprocess.Pop http://bugs.python.org/issue12739 #12736: Request for python casemapping functions to use full not simpl http://bugs.python.org/issue12736 #12735: request full Unicode collation support in std python library http://bugs.python.org/issue12735 #12706: timeout sentinel in ftplib and poplib documentation http://bugs.python.org/issue12706 #12684: profile does not dump stats on exception like cProfile does http://bugs.python.org/issue12684 #12668: 3.2 What's New: it's integer->string, not the opposite http://bugs.python.org/issue12668 Most recent 15 issues waiting for review (15) ============================================= #12785: list_distinfo_file is wrong http://bugs.python.org/issue12785 #12781: Mention SO_REUSEADDR near socket doc examples http://bugs.python.org/issue12781 #12780: Clean up tests for pyc/pyo in __file__ http://bugs.python.org/issue12780 #12778: JSON-serializing a large container takes too much memory http://bugs.python.org/issue12778 #12776: argparse: type conversion function should be called only once http://bugs.python.org/issue12776 #12764: segfault in ctypes.Struct with bad _fields_ http://bugs.python.org/issue12764 #12761: Typo in Doc/license.rst http://bugs.python.org/issue12761 #12760: Add create mode to open() http://bugs.python.org/issue12760 #12740: Add struct.Struct.nmemb http://bugs.python.org/issue12740 #12723: Provide an API in tkSimpleDialog for defining custom validatio http://bugs.python.org/issue12723 #12720: Expose linux extended filesystem attributes http://bugs.python.org/issue12720 #12708: multiprocessing.Pool is missing a starmap[_async]() method. http://bugs.python.org/issue12708 #12691: tokenize.untokenize is broken http://bugs.python.org/issue12691 #12684: profile does not dump stats on exception like cProfile does http://bugs.python.org/issue12684 #12668: 3.2 What's New: it's integer->string, not the opposite http://bugs.python.org/issue12668 Top 10 most discussed issues (10) ================================= #12326: Linux 3: code should avoid using sys.platform == 'linux2' http://bugs.python.org/issue12326 53 msgs #10542: Py_UNICODE_NEXT and other macros for surrogates http://bugs.python.org/issue10542 33 msgs #12729: Python lib re cannot handle Unicode properly due to narrow/wid http://bugs.python.org/issue12729 32 msgs #12760: Add create mode to open() http://bugs.python.org/issue12760 13 msgs #12740: Add struct.Struct.nmemb http://bugs.python.org/issue12740 12 msgs #12775: immense performance problems related to the garbage collector http://bugs.python.org/issue12775 12 msgs #12749: lib re cannot match non-BMP ranges (all versions, all builds) http://bugs.python.org/issue12749 11 msgs #12394: packaging: generate scripts from callable (dotted paths) http://bugs.python.org/issue12394 10 msgs #12750: datetime.strftime('%s') should respect tzinfo http://bugs.python.org/issue12750 9 msgs #8668: Packaging: add a 'develop' command http://bugs.python.org/issue8668 8 msgs Issues closed (25) ================== #8617: Better document user site-packages in site module doc http://bugs.python.org/issue8617 closed by eric.araujo #9173: logger statement not guarded in shutil._make_tarball http://bugs.python.org/issue9173 closed by eric.araujo #10745: setup.py install --user option undocumented http://bugs.python.org/issue10745 closed by eric.araujo #12204: str.upper converts to title http://bugs.python.org/issue12204 closed by ezio.melotti #12256: Link isinstance/issubclass doc to abc module http://bugs.python.org/issue12256 closed by eric.araujo #12646: zlib.Decompress.decompress/flush do not raise any exceptions w http://bugs.python.org/issue12646 closed by nadeem.vawda #12650: Subprocess leaks fd upon kill() http://bugs.python.org/issue12650 closed by neologix #12672: Some problems in documentation extending/newtypes.html http://bugs.python.org/issue12672 closed by eli.bendersky #12711: Explain tracker components in devguide http://bugs.python.org/issue12711 closed by ezio.melotti #12721: Chaotic use of helper functions in test_shutil for reading and http://bugs.python.org/issue12721 closed by eric.araujo #12725: Docs: Odd phrase "floating seconds" in socket.html http://bugs.python.org/issue12725 closed by ezio.melotti #12730: Python's casemapping functions are incorrect for non-BMP chars http://bugs.python.org/issue12730 closed by ezio.melotti #12732: Can't portably use Unicode in Python identifiers http://bugs.python.org/issue12732 closed by python-dev #12744: inefficient pickling of long integers on 64-bit builds http://bugs.python.org/issue12744 closed by pitrou #12747: Move devguide into cpython repo http://bugs.python.org/issue12747 closed by eric.snow #12748: Problems using IDLE accelerators with OS X Dvorak - Qwerty ??? http://bugs.python.org/issue12748 closed by ned.deily #12751: Use macros for surrogates in unicodeobject.c http://bugs.python.org/issue12751 closed by benjamin.peterson #12752: locale.normalize does not take unicode strings http://bugs.python.org/issue12752 closed by barry #12755: Service application crash in python25!PyObject_Malloc http://bugs.python.org/issue12755 closed by haypo #12756: datetime.datetime.utcnow should return a UTC timestamp http://bugs.python.org/issue12756 closed by brett.cannon #12763: test_posix failure on OpenIndiana http://bugs.python.org/issue12763 closed by python-dev #12766: strange interaction between __slots__ and class-level attribut http://bugs.python.org/issue12766 closed by python-dev #12770: Email problem on Windows XP SP3 32bits http://bugs.python.org/issue12770 closed by brian.curtin #12773: classes should have mutable docstrings http://bugs.python.org/issue12773 closed by python-dev #12784: Concatenation of strings returns the wrong string http://bugs.python.org/issue12784 closed by haypo From cs at zip.com.au Sat Aug 20 03:14:19 2011 From: cs at zip.com.au (Cameron Simpson) Date: Sat, 20 Aug 2011 11:14:19 +1000 Subject: [Python-Dev] cpython (3.2): NUL -> NULL In-Reply-To: <20110818185121.GA19783@sleipnir.bytereef.org> References: <20110818185121.GA19783@sleipnir.bytereef.org> Message-ID: <20110820011419.GA8291@cskk.homeip.net> On 18Aug2011 20:51, Stefan Krah wrote: | Antoine Pitrou wrote: | > On Thu, 18 Aug 2011 17:49:28 +0200 | > benjamin.peterson wrote: | > > - PyErr_SetString(PyExc_TypeError, "embedded NUL character"); | > > + PyErr_SetString(PyExc_TypeError, "embedded NULL character"); | > | > Are you sure? IIRC, NUL is the little name of ASCII character 0 | > (while NULL would be the NULL pointer). | | Yes, that's the traditional name. I was surprised that the C99 standard uses | "null character" in almost all cases. Example: | | "The construction '\0' is commonly used to represent the null character." | | So I think it should be either NUL or "null character" with the lower | case spelling. +1 from me, too. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ I like to keep an open mind, but not so open my brains fall out. - New York Times Chairman Arthur Sulzberger From facundobatista at gmail.com Sat Aug 20 12:58:13 2011 From: facundobatista at gmail.com (Facundo Batista) Date: Sat, 20 Aug 2011 07:58:13 -0300 Subject: [Python-Dev] Strange message error in socket.sendto() exception Message-ID: Python 3.2 (r32:88445, Mar 25 2011, 19:28:28) [GCC 4.5.2] on linux2 >>> import socket >>> s = socket.socket() >>> print(s.sendto.__doc__) sendto(data[, flags], address) -> count ... >>> s.sendto(b'data', ('localhost', 3)) Traceback (most recent call last): File "", line 1, in socket.error: [Errno 32] Broken pipe This is ok, I expected this. However, note what happens if I send unicode: >>> s.sendto('data', ('localhost', 3)) Traceback (most recent call last): File "", line 1, in TypeError: sendto() takes exactly 3 arguments (2 given) An error regarding the argument quantity? what? Furthermore, where this message comes from? I tried to find, but the only hint I get is that it could come from "./Modules/_ctypes/_ctypes.c"... are we using ctypes to access socket methods? it's strange, because "sendto" is defined in socketmodule.c. Ideas? Thanks! -- .? ? Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From solipsis at pitrou.net Sat Aug 20 13:08:11 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 20 Aug 2011 13:08:11 +0200 Subject: [Python-Dev] Strange message error in socket.sendto() exception References: Message-ID: <20110820130811.76f9118b@pitrou.net> On Sat, 20 Aug 2011 07:58:13 -0300 Facundo Batista wrote: > > This is ok, I expected this. However, note what happens if I send unicode: > > >>> s.sendto('data', ('localhost', 3)) > Traceback (most recent call last): > File "", line 1, in > TypeError: sendto() takes exactly 3 arguments (2 given) > > An error regarding the argument quantity? what? Here I get (3.2.2, 3.3): >>> s.sendto('data', ('localhost', 3)) Traceback (most recent call last): File "", line 1, in TypeError: 'str' does not support the buffer interface From vinay_sajip at yahoo.co.uk Sat Aug 20 14:00:55 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Sat, 20 Aug 2011 12:00:55 +0000 (UTC) Subject: [Python-Dev] Strange message error in socket.sendto() exception References: Message-ID: Facundo Batista gmail.com> writes: > An error regarding the argument quantity? what? > > Ideas? Thanks! > I think this is the same as http://bugs.python.org/issue5421 tl;dr : fixed in recent versions. Regards, Vinay Sajip From p.f.moore at gmail.com Sat Aug 20 22:52:03 2011 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 20 Aug 2011 21:52:03 +0100 Subject: [Python-Dev] Buildbot failures Message-ID: My buildbot seems to have been failing for a while (I've been away on holiday) - http://www.python.org/dev/buildbot/buildslaves/moore-windows The failures seem to generally be in distutils and/or packaging. I see quite a lot of reds in the waterfall display at the moment, and I can't see any particular issue with my buildbot, so before I go digging further, can anyone confirm (or otherwise) if distutils/packaging is currently generating known failures (and hence, the alerts can be ignored? (I'd only be looking for environment-related problems, I'm afraid - I don't have any distutils/packaging expertise to bring to bear on genuine code issues...) Thanks, Paul. From merwok at netwok.org Sun Aug 21 10:17:25 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Sun, 21 Aug 2011 10:17:25 +0200 Subject: [Python-Dev] [Python-checkins] cpython (3.2): #5301: add image/vnd.microsoft.icon (.ico) MIME type In-Reply-To: References: Message-ID: <4E50BF15.8020502@netwok.org> Hi, However small the commit was, I think it still was a feature request, so I wonder if it was appropriate for the stable versions. Regards From merwok at netwok.org Sun Aug 21 10:11:32 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Sun, 21 Aug 2011 10:11:32 +0200 Subject: [Python-Dev] Buildbot failures In-Reply-To: References: Message-ID: <4E50BDB4.6060809@netwok.org> Le 20/08/2011 22:52, Paul Moore a ?crit : > My buildbot seems to have been failing for a while (I've been away on > holiday) - http://www.python.org/dev/buildbot/buildslaves/moore-windows > > The failures seem to generally be in distutils and/or packaging. I see > quite a lot of reds in the waterfall display at the moment, and I > can't see any particular issue with my buildbot, so before I go > digging further, can anyone confirm (or otherwise) if > distutils/packaging is currently generating known failures (and hence, > the alerts can be ignored? Yes: http://bugs.python.org/issue12678 Regards From sandro.tosi at gmail.com Sun Aug 21 11:09:35 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Sun, 21 Aug 2011 11:09:35 +0200 Subject: [Python-Dev] [Python-checkins] cpython (3.2): #5301: add image/vnd.microsoft.icon (.ico) MIME type In-Reply-To: <4E50BF15.8020502@netwok.org> References: <4E50BF15.8020502@netwok.org> Message-ID: Hi, On Sun, Aug 21, 2011 at 10:17, ?ric Araujo wrote: > Hi, > > However small the commit was, I think it still was a feature request, so > I wonder if it was appropriate for the stable versions. I can see your point: the reason I committed it also on the stable branches is that .ico are already out there (since a long time) and they were currently not recognized. I can call it a bug. Anyhow, if it was not appropriate, just tell me and I'll revert on 2.7 and 3.2 . Thanks for your input! Cheers, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From tjreedy at udel.edu Sun Aug 21 21:12:24 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 21 Aug 2011 15:12:24 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): #5301: add image/vnd.microsoft.icon (.ico) MIME type In-Reply-To: References: <4E50BF15.8020502@netwok.org> Message-ID: On 8/21/2011 5:09 AM, Sandro Tosi wrote: >> However small the commit was, I think it still was a feature request, so >> I wonder if it was appropriate for the stable versions. Good catch. > I can see your point: the reason I committed it also on the stable > branches is that .ico are already out there (since a long time) and > they were currently not recognized. I can call it a bug. But it is not (a behavior bug). Every feature request 'fixes' what its proposer considers to be a design bug or something. > Anyhow, if it was not appropriate, just tell me and I'll revert on 2.7 > and 3.2 . Thanks for your input! It is a new feature for the same reason http://bugs.python.org/issue10730 was. If that had not been added for 3.2.0 (during the beta period, with Georg's permission), it would have waited for 3.3.s Our intent is that the initial CPython x.y.0 release 'freeze' the definition of Python x.y. Code that were to use the new feature in 3.2.3 would not work in 3.2.0,.1,.2, making 3.2.3 define a slight variant. People who want the latest version of an stdlib module should upgrade to the latest release or even download from the repository. For mimetypes, the database can be explicitly augmented in the code and then the code would work in all 2.7 or 3.2 releases. -- Terry Jan Reedy From scott+python-dev at scottdial.com Mon Aug 22 02:10:47 2011 From: scott+python-dev at scottdial.com (Scott Dial) Date: Sun, 21 Aug 2011 20:10:47 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): #5301: add image/vnd.microsoft.icon (.ico) MIME type In-Reply-To: References: <4E50BF15.8020502@netwok.org> Message-ID: <4E519E87.3000205@scottdial.com> On 8/21/2011 3:12 PM, Terry Reedy wrote: > On 8/21/2011 5:09 AM, Sandro Tosi wrote: >> I can see your point: the reason I committed it also on the stable >> branches is that .ico are already out there (since a long time) and >> they were currently not recognized. I can call it a bug. > > But it is not (a behavior bug). Every feature request 'fixes' what its > proposer considers to be a design bug or something. What's the feature added? That's a semantic game. > >> Anyhow, if it was not appropriate, just tell me and I'll revert on 2.7 >> and 3.2 . Thanks for your input! > > It is a new feature for the same reason > http://bugs.python.org/issue10730 was. If that had not been added for > 3.2.0 (during the beta period, with Georg's permission), it would have > waited for 3.3.s ISTM, that Issue #10730 was more contentious because it is *not* an IANA-assigned mime-type, whereas image/vnd.microsoft.icon is and has been since 2003. Whereas image/svg+xml didn't get approved until earlier this month, AFAICT. > Our intent is that the initial CPython x.y.0 release 'freeze' the > definition of Python x.y. Code that were to use the new feature in 3.2.3 > would not work in 3.2.0,.1,.2, making 3.2.3 define a slight variant. > People who want the latest version of an stdlib module should upgrade to > the latest release or even download from the repository. For mimetypes, > the database can be explicitly augmented in the code and then the code > would work in all 2.7 or 3.2 releases. Doesn't that weaken your own argument that changing the list in Lib/mimetypes.py doesn't violate the freeze? Considering that the mime-types are automatically read from a variety of out-of-tree locations? It's already the case that the list of mime-types recognized by a given CPython x.y.z is inconsistent from platform-to-platform and more importantly installation-to-installation (since /etc/mime.types could be customized by a given distribution or modified by a local administrator, and on Windows, the mime-types are scrapped from the registry). On any reasonable system that I can get access to at the moment (Gentoo, OS X, Win7), '.ico' is already associated with 'image/x-icon' via either scrapping the /etc/mime.types or the registry. I think this issue probably originates with CPython 2.6 on Windows, where there was no help from the registry or external mime.types file. Nevertheless, I am +0 for adding entries from the IANA list into stable versions because I don't see how they could ever harm anyone. Any robust program would need to be responsible and populate the mimetypes itself, if it depended on them, otherwise, all bets are off about what types_map contains from run-to-run of a program (because /etc/mime.types might have changed). -- Scott Dial scott at scottdial.com From stephen at xemacs.org Mon Aug 22 03:51:42 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 22 Aug 2011 10:51:42 +0900 Subject: [Python-Dev] [Python-checkins] cpython (3.2): #5301: add image/vnd.microsoft.icon (.ico) MIME type In-Reply-To: <4E519E87.3000205@scottdial.com> References: <4E50BF15.8020502@netwok.org> <4E519E87.3000205@scottdial.com> Message-ID: <871uwekyc1.fsf@uwakimon.sk.tsukuba.ac.jp> Scott Dial writes: > On 8/21/2011 3:12 PM, Terry Reedy wrote: > > On 8/21/2011 5:09 AM, Sandro Tosi wrote: > >> I can see your point: the reason I committed it also on the stable > >> branches is that .ico are already out there (since a long time) and > >> they were currently not recognized. I can call it a bug. > > > > But it is not (a behavior bug). Every feature request 'fixes' what its > > proposer considers to be a design bug or something. > > What's the feature added? That's a semantic game. There's really only one way to fairly objectively resolve this: "Behavior that varies from documented behavior is a bug." Everything else is a feature request, including requests for addition of as-yet undocumented behavior that is quite exactly analogous to existing behavior. Of course you can also play games with the definition of "documentation". If the BDFL says that his Original Intent was that behavior X be supported, I suppose that's Sufficiently Well-Documented (and due to the time machine Always Has Been). Or there may be a blanket statement that "we will conform to the version of external standard Y that is current / draft / whatever when x.y.0 is released," made by the maintainer of the module on python-dev in 1999. What does the documentation say? On a separate issue: > ISTM, that Issue #10730 was more contentious because it is *not* an > IANA-assigned mime-type, whereas image/vnd.microsoft.icon is and has > been since 2003. Is it? Maybe Microsoft has cleaned up their act, but my experience with their IANA assignments is that there's no reliable behavior documented by them -- the registration documents point at internal Microsoft documents that change over time. For example, they added the EURO SYMBOL to several registered MIME charsets without updating the IANA registrations. I don't consider a registration that points to a internal corporate document with variable content to be a suitable specification for open source implementation, even if the IANA can be brib^H^H^H^Hfooled into accepting a registration. > Nevertheless, I am +0 for adding entries from the IANA list into stable > versions because I don't see how they could ever harm anyone. Features that you can't see how they could ever harm anyone are the cracker's favorite back door. Entries in the IANA list enable arbitrarily complex behavior. > Any robust program would need to be responsible and populate the > mimetypes itself, if it depended on them, otherwise, all bets are > off about what types_map contains from run-to-run of a program > (because /etc/mime.types might have changed). That's precisely why Python should not change this, flipped around. A site that carefully controls what's in mime.types should not have to worry about Python changing types_map behind its back in a patch release. The right thing to do is to provide a module that allows the user to request update of the databases automatically, and document how to do it by hand for users who are differently abled net-wise. From tjreedy at udel.edu Mon Aug 22 04:01:40 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 21 Aug 2011 22:01:40 -0400 Subject: [Python-Dev] [Python-checkins] cpython (3.2): #5301: add image/vnd.microsoft.icon (.ico) MIME type In-Reply-To: <4E519E87.3000205@scottdial.com> References: <4E50BF15.8020502@netwok.org> <4E519E87.3000205@scottdial.com> Message-ID: On 8/21/2011 8:10 PM, Scott Dial wrote: > On 8/21/2011 3:12 PM, Terry Reedy wrote: >> But it is not (a behavior bug). Every feature request 'fixes' what its >> proposer considers to be a design bug or something. > > What's the feature added? That's a semantic game. Please. It is an operational decision. I personally would be ok with doing away with bugfix-only releases and just releasing a new version with all patches every 6 months. It certainly would make issue management easier. But most people don't want such rapid change, even to the point of resisting fixes to design errors of 20 years ago. On the other hand, most people want their personal fix/feature included right away, in the next release. But if we do not include everything every release, we make decisions as what to include or not. >> It is a new feature for the same reason >> http://bugs.python.org/issue10730 was. If that had not been added for >> 3.2.0 (during the beta period, with Georg's permission), it would have >> waited for 3.3.s In http://bugs.python.org/msg124332 from that issue, David Murray refers to "the policy stated in mimetypes". I could not find a policy explicitly stated in the doc, not in a quick review of the code. But I believe what he meant is "Include the most commonly used subset of registered extensions. Add more as requested with every x.y version." If it is really not in the doc, I wish it, or an agreed-on revision, were added. "Add more as requested with every x.y.x release." is the alternative that Sandro seems to have followed. > ISTM, that Issue #10730 was more contentious because it is *not* an > IANA-assigned mime-type, whereas image/vnd.microsoft.icon is and has > been since 2003. Whereas image/svg+xml didn't get approved until earlier > this month, AFAICT. If we intended to include all registered mimetypes and this happened to be missing, that would be a bug. But there are scads of mimetypes, especially vender-specific vnd types, that we do not include. Many predate 2003 and are probably obsolete, and hence well not included. There might be others that are used generally. -- Terry Jan Reedy From barry at python.org Mon Aug 22 17:44:29 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 22 Aug 2011 11:44:29 -0400 Subject: [Python-Dev] Call for participants: Windows Python security experts Message-ID: <20110822114429.4491b9f9@resist.wooz.org> Hi folks, The Python security team is a small group of core Python developers who triage and respond to vulnerability reports sent to security at python.org. We get all kinds of reports, for which we try to provide guidance and feedback, review patches, etc. Python being as secure as it is, traffic is fairly low. :) We have a dearth of Windows expertise on the team though, so I am putting out a call for participants. If you are an expert on Python for Windows operating systems and can make judgments about the validity of security reports for the platform, please contact us. Core developers are preferred, but motivation and available time is paramount. You're welcome to apply even if you're not a Windows expert, if you have the time and ability to help out in general. If you're interested, you can reach the team at security at python.org. Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From torsten.becker at gmail.com Mon Aug 22 20:58:51 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Mon, 22 Aug 2011 14:58:51 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project Message-ID: Hello all, I have implemented an initial version of PEP 393 -- "Flexible String Representation" as part of my Google Summer of Code project. My patch is hosted as a repository on bitbucket [1] and I created a related issue on the bug tracker [2]. I posted documentation for the current state of the development in the wiki [3]. Current tests show a potential reduction of memory by about 20% and CPU by 50% for a join micro benchmark. Starting a new interpreter still causes 3244 calls to create compatibility Py_UNICODE representations, 263 strings are created using the old API while 62719 are created using the new API. More measurements are on the wiki page [3]. If there is interest, I would like to continue working on the patch with the goal of getting it into Python 3.3. Any and all feedback is welcome. Regards, Torsten [1]: http://www.python.org/dev/peps/pep-0393 [2]: http://bugs.python.org/issue12819 [3]: http://wiki.python.org/moin/SummerOfCode/2011/PEP393 From v+python at g.nevcal.com Mon Aug 22 22:24:44 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Mon, 22 Aug 2011 13:24:44 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: Message-ID: <4E52BB0C.3010208@g.nevcal.com> On 8/22/2011 11:58 AM, Torsten Becker wrote: > Hello all, > > I have implemented an initial version of PEP 393 -- "Flexible String > Representation" as part of my Google Summer of Code project. My patch > is hosted as a repository on bitbucket [1] and I created a related > issue on the bug tracker [2]. I posted documentation for the current > state of the development in the wiki [3]. > > Current tests show a potential reduction of memory by about 20% and > CPU by 50% for a join micro benchmark. Starting a new interpreter > still causes 3244 calls to create compatibility Py_UNICODE > representations, 263 strings are created using the old API while 62719 > are created using the new API. More measurements are on the wiki page > [3]. > > If there is interest, I would like to continue working on the patch > with the goal of getting it into Python 3.3. Any and all feedback is > welcome. Sounds like great progress! -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Aug 23 00:14:40 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 00:14:40 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project References: Message-ID: <20110823001440.433a0f1f@pitrou.net> Hello, On Mon, 22 Aug 2011 14:58:51 -0400 Torsten Becker wrote: > > I have implemented an initial version of PEP 393 -- "Flexible String > Representation" as part of my Google Summer of Code project. My patch > is hosted as a repository on bitbucket [1] and I created a related > issue on the bug tracker [2]. I posted documentation for the current > state of the development in the wiki [3]. A couple of minor comments: - ?The UTF-8 decoding fast path for ASCII only characters was removed and replaced with a memcpy if the entire string is ASCII.? The fast path would still be useful for mostly-ASCII strings, which are extremely common (unless UTF-8 has become a no-op?). - You could trim the debug results from the benchmark results, this may make them more readable. - You could try to run stringbench, which can be found at http://svn.python.org/projects/sandbox/trunk/stringbench (*) and there's iobench (the text mode benchmarks) in the Tools/iobench directory. (*) (yes, apparently we forgot to convert this one to Mercurial) Regards Antoine. From solipsis at pitrou.net Tue Aug 23 00:15:43 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 00:15:43 +0200 Subject: [Python-Dev] cpython (3.2): #9200: The str.is* methods now work with strings that contain non-BMP References: Message-ID: <20110823001543.7064e8e3@pitrou.net> On Mon, 22 Aug 2011 19:31:32 +0200 ezio.melotti wrote: > http://hg.python.org/cpython/rev/06b30c5bcc3d > changeset: 72035:06b30c5bcc3d > branch: 3.2 > parent: 72026:c8e73a89150e > user: Ezio Melotti > date: Mon Aug 22 14:08:38 2011 +0300 > summary: > #9200: The str.is* methods now work with strings that contain non-BMP characters even in narrow Unicode builds. That's a very cool improvement! cheers Antoine. From sandro.tosi at gmail.com Tue Aug 23 01:09:28 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Tue, 23 Aug 2011 01:09:28 +0200 Subject: [Python-Dev] Sphinx version for Python 2.x docs In-Reply-To: <4E4AF610.5040303@simplistix.co.uk> References: <4E4AF610.5040303@simplistix.co.uk> Message-ID: Hi all, > Any chance the version of sphinx used to generate the docs on > docs.python.org could be updated? I'd like to discuss this aspect, in particular for the implication it has on http://bugs.python.org/issue12409 . Personally, I do think it has a value to have the same set of tools to build the Python documentation of the currently active branches. Currently, only 2.7 is different, since it still fetches (from svn.python.org... can we fix this too? suggestions welcome!) sphinx 0.6.7 while 3.2/3.3 uses 1.0.7. If you're worried about the time needed to convert the actual 2.7 doc to new sphinx format and all the related changes, I volunteer to do the job (and/or collaborate with whom is already on it), but what I want to understand if it's an acceptable change. I see sphinx more as of an internal, building tool, so freezing it it's like saying "don't upgrade gcc" or so. Now the delta is just the C functions definitions and some py-specific roles, but during the years it will increase. Keeping it small, simplifying the forward-port of doc patches (not needing to have 2 version between 2.7 and 3.x f.e.) and having a common set of tools for doc building is worth IMHO. What do you think about it? and yes Georg, I'd like to hear your opinion too :) Cheers, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From stefan_ml at behnel.de Tue Aug 23 09:02:54 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Aug 2011 09:02:54 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: Message-ID: Torsten Becker, 22.08.2011 20:58: > I have implemented an initial version of PEP 393 -- "Flexible String > Representation" as part of my Google Summer of Code project. My patch > is hosted as a repository on bitbucket [1] and I created a related > issue on the bug tracker [2]. I posted documentation for the current > state of the development in the wiki [3]. Very cool! I've started fixing up Cython for it. One thing I noticed: on platforms where wchar_t is signed, the comparison to "128U" in the Py_UNICODE_ISSPACE() macro may issue a warning when applied to a Py_UNICODE value (which it previously was officially defined on). For the sake of portability of existing code, this may be worth a work-around. Personally, I wouldn't really mind getting this warning, given that it's better to use Py_UCS4 instead of Py_UNICODE. But it may turn out to be an annoyance for users, because their code that does this isn't actually broken in the new world. And one thing that I find unfortunate is that we need a new (unexpected) _GET_LENGTH() next to the existing (and obvious) _GET_SIZE(), but I guess that's a somewhat small price to pay for backwards compatibility... Stefan From martin at v.loewis.de Tue Aug 23 10:55:40 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 23 Aug 2011 10:55:40 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110823001440.433a0f1f@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> Message-ID: <4E536B0C.8050008@v.loewis.de> > - ?The UTF-8 decoding fast path for ASCII only characters was removed > and replaced with a memcpy if the entire string is ASCII.? > The fast path would still be useful for mostly-ASCII strings, which > are extremely common (unless UTF-8 has become a no-op?). Is it really extremely common to have strings that are mostly-ASCII but not completely ASCII? I would agree that pure ASCII strings are extremely common. Regards, Martin From stefan_ml at behnel.de Tue Aug 23 11:32:44 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Aug 2011 11:32:44 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E536B0C.8050008@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> Message-ID: "Martin v. L?wis", 23.08.2011 10:55: >> - ?The UTF-8 decoding fast path for ASCII only characters was removed >> and replaced with a memcpy if the entire string is ASCII.? >> The fast path would still be useful for mostly-ASCII strings, which >> are extremely common (unless UTF-8 has become a no-op?). > > Is it really extremely common to have strings that are mostly-ASCII but > not completely ASCII? Maybe not as "extremely common" as pure ASCII strings, but at least for western European languages, "mostly ASCII" strings are very common indeed. Stefan From python-dev at masklinn.net Tue Aug 23 11:46:12 2011 From: python-dev at masklinn.net (Xavier Morel) Date: Tue, 23 Aug 2011 11:46:12 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E536B0C.8050008@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> Message-ID: On 2011-08-23, at 10:55 , Martin v. L?wis wrote: >> - ?The UTF-8 decoding fast path for ASCII only characters was removed >> and replaced with a memcpy if the entire string is ASCII.? >> The fast path would still be useful for mostly-ASCII strings, which >> are extremely common (unless UTF-8 has become a no-op?). > > Is it really extremely common to have strings that are mostly-ASCII but > not completely ASCII? I would agree that pure ASCII strings are > extremely common. Mostly ascii is pretty common for western-european languages (French, for instance, is probably 90 to 95% ascii). It's also a risk in english, when the writer "correctly" spells foreign words (r?sum? and the like). From martin at v.loewis.de Tue Aug 23 12:20:28 2011 From: martin at v.loewis.de (=?windows-1252?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 23 Aug 2011 12:20:28 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> Message-ID: <4E537EEC.1070602@v.loewis.de> Am 23.08.2011 11:46, schrieb Xavier Morel: > On 2011-08-23, at 10:55 , Martin v. L?wis wrote: >>> - ?The UTF-8 decoding fast path for ASCII only characters was removed >>> and replaced with a memcpy if the entire string is ASCII.? >>> The fast path would still be useful for mostly-ASCII strings, which >>> are extremely common (unless UTF-8 has become a no-op?). >> >> Is it really extremely common to have strings that are mostly-ASCII but >> not completely ASCII? I would agree that pure ASCII strings are >> extremely common. > Mostly ascii is pretty common for western-european languages (French, for > instance, is probably 90 to 95% ascii). It's also a risk in english, when > the writer "correctly" spells foreign words (r?sum? and the like). I know - I still question whether it is "extremely common" (so much as to justify a special case). I.e. on what application with what dataset would you gain what speedup, at the expense of what amount of extra lines, and potential slow-down for other datasets? For the record, the optimization in question is the one where it masks a long word with 0x80808080L, to see whether it is completely ASCII, and then copies four characters in an unrolled fashion. It stops doing so when it sees a non-ASCII character, and returns to that mode when it gets to the next aligned memory address that stores only ASCII characters. In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size. Regards, Martin From solipsis at pitrou.net Tue Aug 23 13:39:02 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 13:39:02 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E537EEC.1070602@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> Message-ID: <1314099542.3485.10.camel@localhost.localdomain> > >> Is it really extremely common to have strings that are mostly-ASCII but > >> not completely ASCII? I would agree that pure ASCII strings are > >> extremely common. > > Mostly ascii is pretty common for western-european languages (French, for > > instance, is probably 90 to 95% ascii). It's also a risk in english, when > > the writer "correctly" spells foreign words (r?sum? and the like). > > I know - I still question whether it is "extremely common" (so much as > to justify a special case). Well, it's: - all natural languages based on a variant of the latin alphabet - but also, XML, JSON, HTML documents... - and log files... - in short, any kind of parsable format which is structurally ASCII but and can contain arbitrary unicode So I would say *most* unicode data out there is mostly-ASCII, even when it has Japanese characters in it. The rationale is that most unicode data processed by computers is structured. This optimization was done when trying to improve the speed of text I/O. > In the PEP 393 approach, if the string has a two-byte representation, > each character needs to widened to two bytes, and likewise for four > bytes. So three separate copies of the unrolled loop would be needed, > one for each target size. Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach? Regards Antoine. From martin at v.loewis.de Tue Aug 23 13:51:58 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 23 Aug 2011 13:51:58 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <1314099542.3485.10.camel@localhost.localdomain> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> Message-ID: <4E53945E.1050102@v.loewis.de> > This optimization was done when trying to improve the speed of text I/O. So what speedup did it achieve, for the kind of data you talked about? > Do you have three copies of the UTF-8 decoder already, or do you a use a > stringlib-like approach? It's a single implementation - see for yourself. Regards, Martin From stefan_ml at behnel.de Tue Aug 23 14:14:39 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Aug 2011 14:14:39 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: Message-ID: Torsten Becker, 22.08.2011 20:58: > I have implemented an initial version of PEP 393 -- "Flexible String > Representation" as part of my Google Summer of Code project. My patch > is hosted as a repository on bitbucket [1] and I created a related > issue on the bug tracker [2]. I posted documentation for the current > state of the development in the wiki [3]. One thing that occurred to me regarding the object struct: typedef struct { PyObject_HEAD Py_ssize_t length; /* Number of code points in the string */ void *str; /* Canonical, smallest-form Unicode buffer */ Py_hash_t hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this * object are *not* counted in ob_refcnt. * See SSTATE_KIND_* for other bits */ Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the * terminating \0. */ char *utf8; /* UTF-8 representation (null-terminated) */ Py_ssize_t wstr_length; /* Number of code points in wstr, possible * surrogates count as two code points. */ wchar_t *wstr; /* wchar_t representation (null-terminated) */ } PyUnicodeObject; Wouldn't the "normal" approach be to use a union for the str field? I.e. union str { unsigned char* latin1; Py_UCS2* ucs2; Py_UCS4* ucs4; } Given that they're all pointers, all fields have the same size, but I find it more readable to write u.str.latin1 than ((const unsigned char*)u.str) Plus, the three types would be given by the struct, rather than by a per-usage cast. Has this been considered before? Was there a reason to decide against it? Stefan From solipsis at pitrou.net Tue Aug 23 14:15:45 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 14:15:45 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53945E.1050102@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> Message-ID: <1314101745.3485.18.camel@localhost.localdomain> Le mardi 23 ao?t 2011 ? 13:51 +0200, "Martin v. L?wis" a ?crit : > > This optimization was done when trying to improve the speed of text I/O. > > So what speedup did it achieve, for the kind of data you talked about? Since I don't have the number anymore, I've just saved the contents of https://linuxfr.org/news/le-noyau-linux-est-disponible-en-version%C2%A030 as a "linuxfr.html" file and then did: $ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()" 1000 loops, best of 3: 859 usec per loop After disabling the fast path, I ran the micro-benchmark again: $ ./python -m timeit "with open('linuxfr.html', encoding='utf8') as f: f.read()" 1000 loops, best of 3: 1.09 msec per loop so that's a 20% speedup. > > Do you have three copies of the UTF-8 decoder already, or do you a use a > > stringlib-like approach? > > It's a single implementation - see for yourself. So why would you need three separate implementation of the unrolled loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR. Even without taking into account the unrolled loop, I wonder how much slower UTF-8 decoding becomes with that approach, by the way. Instead of testing the "kind" variable at each loop iteration, using a stringlib-like approach may be a better deal IMO. Of course we would first need to have various benchmark numbers once the current PEP 393 implementation is complete. Regards Antoine. From martin at v.loewis.de Tue Aug 23 15:06:25 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 23 Aug 2011 15:06:25 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <1314101745.3485.18.camel@localhost.localdomain> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> Message-ID: <4E53A5D1.2040808@v.loewis.de> > So why would you need three separate implementation of the unrolled > loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR. Depending on where the speedup comes from in this optimization, it may well be that the overhead of figuring out where to store the result eats the gain from the fast test. > Even without taking into account the unrolled loop, I wonder how much > slower UTF-8 decoding becomes with that approach, by the way. In some cases, tests show that it gets faster, overall, compared to 3.2. This is probably because strings take less memory, which means less copying, more cache locality, etc. Of course, it still may be possible to apply micro-optimizations to the new implementation. > Instead of > testing the "kind" variable at each loop iteration, using a > stringlib-like approach may be a better deal IMO. Well, things have to be done in order: 1. the PEP needs to be approved 2. the performance bottlenecks need to be identified 3. optimizations should be applied. I'm not sure what you mean by "stringlib-like" approach - if you are talking about templating, I'd rather avoid this for maintainability reasons, unless significant improvements can be demonstrated. Torsten had a version that used macros for that, and it was a pain to debug. So we put correctness and readability first. Regards, Martin From martin at v.loewis.de Tue Aug 23 15:17:46 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 23 Aug 2011 15:17:46 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: Message-ID: <4E53A87A.1070306@v.loewis.de> > Has this been considered before? Was there a reason to decide against it? I think we simply didn't consider it. An early version of the PEP used the lower bits for the pointer to encode the kind, in which case it even stopped being a pointer. Modules are not expected to access this pointer except through the macros, so it may not matter that much. OTOH, it's certainly not too late to change it. Regards, Martin From solipsis at pitrou.net Tue Aug 23 15:20:38 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 15:20:38 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53A5D1.2040808@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> Message-ID: <1314105638.3485.23.camel@localhost.localdomain> > Well, things have to be done in order: > 1. the PEP needs to be approved > 2. the performance bottlenecks need to be identified > 3. optimizations should be applied. Sure, but the whole point of the PEP is to improve performance (I am dumping "memory consumption" in the "performance" bucket). That is, I suppose it will get approval based on its demonstrated benefits. > I'm not sure what you mean by "stringlib-like" approach - if you are > talking about templating, I'd rather avoid this for maintainability > reasons, unless significant improvements can be demonstrated. Torsten > had a version that used macros for that, and it was a pain to debug. The point of templating is precisely to avoid macros, so that the code is natural to read and write and the compiler gives you the right line number when it finds an error. Regards Antoine. From victor.stinner at haypocalc.com Tue Aug 23 15:21:20 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Tue, 23 Aug 2011 15:21:20 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53A5D1.2040808@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> Message-ID: <4E53A950.30005@haypocalc.com> Le 23/08/2011 15:06, "Martin v. L?wis" a ?crit : > Well, things have to be done in order: > 1. the PEP needs to be approved > 2. the performance bottlenecks need to be identified > 3. optimizations should be applied. I would not vote for the PEP if it slows down Python, especially if it's much slower. But Torsten says that it speeds up Python, which is surprising. I have to do my own benchmarks :-) Victor From stefan_ml at behnel.de Tue Aug 23 16:02:54 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Aug 2011 16:02:54 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53A87A.1070306@v.loewis.de> References: <4E53A87A.1070306@v.loewis.de> Message-ID: "Martin v. L?wis", 23.08.2011 15:17: >> Has this been considered before? Was there a reason to decide against it? > > I think we simply didn't consider it. An early version of the PEP used > the lower bits for the pointer to encode the kind, in which case it even > stopped being a pointer. Modules are not expected to access this > pointer except through the macros, so it may not matter that much. The difference is that you *could* access them directly in a safe way, if it was a union. So, for an efficient character loop, replicated for performance reasons or for character range handling reasons or whatever, you could just check the string kind and then jump to the loop implementation that handles that type, without using any further macros. Stefan From ncoghlan at gmail.com Tue Aug 23 16:05:14 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 00:05:14 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53A87A.1070306@v.loewis.de> References: <4E53A87A.1070306@v.loewis.de> Message-ID: On Tue, Aug 23, 2011 at 11:17 PM, "Martin v. L?wis" wrote: >> Has this been considered before? Was there a reason to decide against it? > > I think we simply didn't consider it. An early version of the PEP used > the lower bits for the pointer to encode the kind, in which case it even > stopped being a pointer. Modules are not expected to access this > pointer except through the macros, so it may not matter that much. > > OTOH, it's certainly not too late to change it. It would make the macro implementations a bit clearer, so +1 for the union approach from me. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From solipsis at pitrou.net Tue Aug 23 16:08:20 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 16:08:20 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project References: <4E53A87A.1070306@v.loewis.de> Message-ID: <20110823160820.08754ffe@pitrou.net> On Tue, 23 Aug 2011 16:02:54 +0200 Stefan Behnel wrote: > "Martin v. L?wis", 23.08.2011 15:17: > >> Has this been considered before? Was there a reason to decide against it? > > > > I think we simply didn't consider it. An early version of the PEP used > > the lower bits for the pointer to encode the kind, in which case it even > > stopped being a pointer. Modules are not expected to access this > > pointer except through the macros, so it may not matter that much. > > The difference is that you *could* access them directly in a safe way, if > it was a union. > > So, for an efficient character loop, replicated for performance reasons or > for character range handling reasons or whatever, you could just check the > string kind and then jump to the loop implementation that handles that > type, without using any further macros. Macros are useful to shield the abstraction from the implementation. If you access the members directly, and the unicode object is represented differently in some future version of Python (say e.g. with tagged pointers), your code doesn't compile anymore. Regards Antoine. From ncoghlan at gmail.com Tue Aug 23 16:13:11 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 00:13:11 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53A950.30005@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> Message-ID: On Tue, Aug 23, 2011 at 11:21 PM, Victor Stinner wrote: > Le 23/08/2011 15:06, "Martin v. L?wis" a ?crit : >> >> Well, things have to be done in order: >> 1. the PEP needs to be approved >> 2. the performance bottlenecks need to be identified >> 3. optimizations should be applied. > > I would not vote for the PEP if it slows down Python, especially if it's > much slower. But Torsten says that it speeds up Python, which is surprising. > I have to do my own benchmarks :-) As Martin noted, cache misses hurt performance so much on modern processors that making things use less memory overall can actually be a speed optimisation as well. Guessing where the remaining bottlenecks are is unlikely to be effective - profiling of the preliminary implementation will be needed. However, the idea that reducing the size of pure ASCII strings (which include all the identifiers in most code) by a factor of 2 or 4 (or so) results in a net speed increase definitely sounds plausible to me, even for non-string processing code. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From stefan_ml at behnel.de Tue Aug 23 17:18:13 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Aug 2011 17:18:13 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110823160820.08754ffe@pitrou.net> References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: Antoine Pitrou, 23.08.2011 16:08: > On Tue, 23 Aug 2011 16:02:54 +0200 > Stefan Behnel wrote: >> "Martin v. L?wis", 23.08.2011 15:17: >>>> Has this been considered before? Was there a reason to decide against it? >>> >>> I think we simply didn't consider it. An early version of the PEP used >>> the lower bits for the pointer to encode the kind, in which case it even >>> stopped being a pointer. Modules are not expected to access this >>> pointer except through the macros, so it may not matter that much. >> >> The difference is that you *could* access them directly in a safe way, if >> it was a union. >> >> So, for an efficient character loop, replicated for performance reasons or >> for character range handling reasons or whatever, you could just check the >> string kind and then jump to the loop implementation that handles that >> type, without using any further macros. > > Macros are useful to shield the abstraction from the implementation. If > you access the members directly, and the unicode object is represented > differently in some future version of Python (say e.g. with tagged > pointers), your code doesn't compile anymore. Even with tagged pointers, you could just provide a macro that unpacks the pointer to the buffer for a given string kind. I don't think there's much more to be done to keep up the abstraction. I don't see a reason to prevent users from accessing the memory buffer directly, especially not by (accidental, as I understand it) obfuscation through a void*. Stefan From martin at v.loewis.de Tue Aug 23 18:12:32 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 23 Aug 2011 18:12:32 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: <4E53D170.1030404@v.loewis.de> > Even with tagged pointers, you could just provide a macro that unpacks > the pointer to the buffer for a given string kind. These macros are indeed available. > I don't think there's > much more to be done to keep up the abstraction. I don't see a reason to > prevent users from accessing the memory buffer directly, especially not > by (accidental, as I understand it) obfuscation through a void*. It's not about preventing them from accessing the representation. It's an "internal public" structure just as all other object layouts (i.e. feel free to use them, but expect them to change with the next release). However, I still think that people rarely will: - most code treats strings as opaque, just as any other PyObject* - code that is aware of strings typically wants them in an encoded form, often UTF-8, or whatever the underlying C library expects. - code that does need to look at individual characters should be fine with the accessor macros. That said, I can readily believe that Cython would have a use for direct access to the structure. I just wouldn't want people to rewrite their code in four versions (three for the different 3.3 representations, plus one for 3.2 and earlier). Regards, Martin From nir at winpdb.org Tue Aug 23 20:02:25 2011 From: nir at winpdb.org (Nir Aides) Date: Tue, 23 Aug 2011 21:02:25 +0300 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" Message-ID: Hi all, Please consider this invitation to stick your head into an interesting problem: http://bugs.python.org/issue6721 Nir -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Aug 23 20:20:04 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 20:20:04 +0200 Subject: [Python-Dev] FileSystemError or FilesystemError? Message-ID: <20110823202004.0bb63490@pitrou.net> Hello, When reviewing the PEP 3151 implementation (*), Ezio commented that "FileSystemError" looks a bit strange and that "FilesystemError" would be a better spelling. What is your opinion? (*) http://bugs.python.org/issue12555 Thank you Antoine. From sandro.tosi at gmail.com Tue Aug 23 20:30:50 2011 From: sandro.tosi at gmail.com (Sandro Tosi) Date: Tue, 23 Aug 2011 20:30:50 +0200 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823202004.0bb63490@pitrou.net> References: <20110823202004.0bb63490@pitrou.net> Message-ID: On Tue, Aug 23, 2011 at 20:20, Antoine Pitrou wrote: > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? FilesystemError. Cheers, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi From rosslagerwall at gmail.com Tue Aug 23 20:39:36 2011 From: rosslagerwall at gmail.com (Ross Lagerwall) Date: Tue, 23 Aug 2011 20:39:36 +0200 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823202004.0bb63490@pitrou.net> References: <20110823202004.0bb63490@pitrou.net> Message-ID: <1314124776.1538.2.camel@hobo> > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? I don't think it really matters since both "file system" and "filesystem" appear to be in common usage. I would say +1 to "FileSystemError" -- i.e. take file system as two words. Cheers Ross From cf.natali at gmail.com Tue Aug 23 20:43:25 2011 From: cf.natali at gmail.com (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Tue, 23 Aug 2011 20:43:25 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: Message-ID: 2011/8/23, Nir Aides : > Hi all, Hello Nir, > Please consider this invitation to stick your head into an interesting > problem: > http://bugs.python.org/issue6721 Just for the record, I'm now in favor of the atfork mechanism. It won't solve the problem for I/O locks, but it'll at least make room for a clean and cross-library way to setup atfork handlers. I just skimmed over it, but it seemed Gregory's atfork module could be a good starting point. cf From nadeem.vawda at gmail.com Tue Aug 23 20:43:45 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Tue, 23 Aug 2011 20:43:45 +0200 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <1314124776.1538.2.camel@hobo> References: <20110823202004.0bb63490@pitrou.net> <1314124776.1538.2.camel@hobo> Message-ID: On Tue, Aug 23, 2011 at 8:39 PM, Ross Lagerwall wrote: >> When reviewing the PEP 3151 implementation (*), Ezio commented that >> "FileSystemError" looks a bit strange and that "FilesystemError" would >> be a better spelling. What is your opinion? I think "FilesystemError" looks nicer, but it's not something I'd lose sleep over either way. Cheers, Nadeem From brian.curtin at gmail.com Tue Aug 23 20:46:09 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Tue, 23 Aug 2011 13:46:09 -0500 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823202004.0bb63490@pitrou.net> References: <20110823202004.0bb63490@pitrou.net> Message-ID: On Tue, Aug 23, 2011 at 13:20, Antoine Pitrou wrote: > > Hello, > > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? > > (*) http://bugs.python.org/issue12555 > > Thank you > > Antoine. I don't care all that much but I'm reminded of the .NET FileSystemWatcher class, so put me down for +0.5 on FileSystemError. -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Tue Aug 23 20:46:40 2011 From: barry at python.org (Barry Warsaw) Date: Tue, 23 Aug 2011 14:46:40 -0400 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <1314124776.1538.2.camel@hobo> References: <20110823202004.0bb63490@pitrou.net> <1314124776.1538.2.camel@hobo> Message-ID: <20110823144640.3aad9853@resist.wooz.org> On Aug 23, 2011, at 08:39 PM, Ross Lagerwall wrote: >> When reviewing the PEP 3151 implementation (*), Ezio commented that >> "FileSystemError" looks a bit strange and that "FilesystemError" would >> be a better spelling. What is your opinion? > >I don't think it really matters since both "file system" and >"filesystem" appear to be in common usage. > >I would say +1 to "FileSystemError" -- i.e. take file system as two >words. My online dictionaries prefer "file system" to be two words, so for me, FileSystemError is preferred. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From solipsis at pitrou.net Tue Aug 23 20:51:47 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 20:51:47 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" References: Message-ID: <20110823205147.3349eaa8@pitrou.net> On Tue, 23 Aug 2011 20:43:25 +0200 Charles-Fran?ois Natali wrote: > > Please consider this invitation to stick your head into an interesting > > problem: > > http://bugs.python.org/issue6721 > > Just for the record, I'm now in favor of the atfork mechanism. It > won't solve the problem for I/O locks, but it'll at least make room > for a clean and cross-library way to setup atfork handlers. I just > skimmed over it, but it seemed Gregory's atfork module could be a good > starting point. Well, I would consider the I/O locks the most glaring problem. Right now, your program can freeze if you happen to do a fork() while e.g. the stderr lock is taken by another thread (which is quite common when debugging). Regards Antoine. From ethan at stoneleaf.us Tue Aug 23 21:11:54 2011 From: ethan at stoneleaf.us (Ethan Furman) Date: Tue, 23 Aug 2011 12:11:54 -0700 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823202004.0bb63490@pitrou.net> References: <20110823202004.0bb63490@pitrou.net> Message-ID: <4E53FB7A.5030506@stoneleaf.us> Antoine Pitrou wrote: > Hello, > > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? FileSystemError From stefan-usenet at bytereef.org Tue Aug 23 21:06:05 2011 From: stefan-usenet at bytereef.org (Stefan Krah) Date: Tue, 23 Aug 2011 21:06:05 +0200 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823144640.3aad9853@resist.wooz.org> References: <20110823202004.0bb63490@pitrou.net> <1314124776.1538.2.camel@hobo> <20110823144640.3aad9853@resist.wooz.org> Message-ID: <20110823190605.GA16790@sleipnir.bytereef.org> Barry Warsaw wrote: > My online dictionaries prefer "file system" to be two words, so for me, > FileSystemError is preferred. +1 Stefan Krah From steve at pearwood.info Tue Aug 23 21:19:23 2011 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 24 Aug 2011 05:19:23 +1000 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823202004.0bb63490@pitrou.net> References: <20110823202004.0bb63490@pitrou.net> Message-ID: <4E53FD3B.7000705@pearwood.info> Antoine Pitrou wrote: > Hello, > > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? It's a file system (two words), not filesystem (not in any dictionary or spell checker I've ever used). (Nor do we write filingsystem, governmentsystem, politicalsystem or schoolsystem. This is English, not German.) -- Steven From neologix at free.fr Tue Aug 23 22:07:22 2011 From: neologix at free.fr (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Tue, 23 Aug 2011 22:07:22 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: <20110823205147.3349eaa8@pitrou.net> References: <20110823205147.3349eaa8@pitrou.net> Message-ID: 2011/8/23 Antoine Pitrou : > Well, I would consider the I/O locks the most glaring problem. Right > now, your program can freeze if you happen to do a fork() while e.g. > the stderr lock is taken by another thread (which is quite common when > debugging). Indeed. To solve this, a similar mechanism could be used: after fork(), in the child process: - just reset each I/O lock (destroy/re-create the lock) if we can guarantee that the file object is in a consistent state (i.e. that all the invariants hold). That's the approach I used in my initial patch. - call a fileobject method which resets the I/O lock and sets the file object to a consistent state (in other word, an atfork handler) From vinay_sajip at yahoo.co.uk Tue Aug 23 22:11:21 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Tue, 23 Aug 2011 20:11:21 +0000 (UTC) Subject: [Python-Dev] FileSystemError or FilesystemError? References: <20110823202004.0bb63490@pitrou.net> Message-ID: Antoine Pitrou pitrou.net> writes: > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? +1 for FileSystemError as I, like others, don't regard "filesystem" as a proper word. Regards, Vinay Sajip From _ at lvh.cc Tue Aug 23 22:17:34 2011 From: _ at lvh.cc (Laurens Van Houtven) Date: Tue, 23 Aug 2011 22:17:34 +0200 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823144640.3aad9853@resist.wooz.org> References: <20110823202004.0bb63490@pitrou.net> <1314124776.1538.2.camel@hobo> <20110823144640.3aad9853@resist.wooz.org> Message-ID: On Tue, Aug 23, 2011 at 8:46 PM, Barry Warsaw wrote: > On Aug 23, 2011, at 08:39 PM, Ross Lagerwall wrote: > > >> When reviewing the PEP 3151 implementation (*), Ezio commented that > >> "FileSystemError" looks a bit strange and that "FilesystemError" would > >> be a better spelling. What is your opinion? > > > >I don't think it really matters since both "file system" and > >"filesystem" appear to be in common usage. > > > >I would say +1 to "FileSystemError" -- i.e. take file system as two > >words. > > My online dictionaries prefer "file system" to be two words, so for me, > FileSystemError is preferred. > > -Barry > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/_%40lvh.cc > > +1 -- cheers lvh -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Aug 23 22:29:22 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 23 Aug 2011 22:29:22 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> Message-ID: <1314131362.3485.36.camel@localhost.localdomain> Le mardi 23 ao?t 2011 ? 22:07 +0200, Charles-Fran?ois Natali a ?crit : > 2011/8/23 Antoine Pitrou : > > Well, I would consider the I/O locks the most glaring problem. Right > > now, your program can freeze if you happen to do a fork() while e.g. > > the stderr lock is taken by another thread (which is quite common when > > debugging). > > Indeed. > To solve this, a similar mechanism could be used: after fork(), in the > child process: > - just reset each I/O lock (destroy/re-create the lock) if we can > guarantee that the file object is in a consistent state (i.e. that all > the invariants hold). That's the approach I used in my initial patch. For I/O locks I think that would work. There could also be a process-wide "fork lock" to serialize locks and other operations, if we want 100% guaranteed consistency of I/O objects across forks. > - call a fileobject method which resets the I/O lock and sets the file > object to a consistent state (in other word, an atfork handler) I fear that the complication with atfork handlers is that you have to manage their lifecycle as well (i.e., when an IO object is destroyed, you have to unregister the handler). Regards Antoine. From barry at python.org Tue Aug 23 23:03:57 2011 From: barry at python.org (Barry Warsaw) Date: Tue, 23 Aug 2011 17:03:57 -0400 Subject: [Python-Dev] PEP 3151 from the BDFOP Message-ID: <20110823170357.3b3ab2fc@resist.wooz.org> I am sending this review as the BDFOP for PEP 3151. I've read the PEP and reviewed the python-dev discussion via Gmane. I have not reviewed the hg branch where Antoine has implemented it. I'm not quite ready to pronounce, but I do have some questions and comments. First off, thanks to Antoine for taking this issue on, and for his well written and well reasoned PEP. There's definitely a problem here and I think Python will be better off for having addressed it. I, for one, will be very happy when I can eliminate the majority of `import errno`s from my code. ;) One guiding principle for me is that we should keep the abstraction as thin as possible. In particular, I'm concerned about mapping multiple errnos into a single Error. For example both EPIPE and ESHUTDOWN mapping to BrokePipeError, or EACESS or EPERM to PermissionError. I think we should resist this, so that one errno maps to exactly one Error. Where grouping is desired, Python already has mechanisms to deal with that, e.g. superclasses and multiple inheritance. Therefore, I think it would be better to have + FileSystemPermissionError + AccessError (EACCES) + PermissionError (EPERM) Yes, it makes the hierarchy deeper, and means you have to come up with a few more names, but I think it will also make it easier for the programmer to use and debug. Also, some of the artificial hierarchy introduced in the PEP may not be necessary (e.g. the faux superclass FileSystemPermissionError above). This might lead to the elimination of FileSystemError as some have suggested (I too question its utility). Similarly, I think it would be helpful to have the errno name (e.g. ENOENT) in the error message string. That way, it won't get in the way for most code, but would be usefully printed out for uncaught exceptions. A second guiding principle should be that careful code that works in Python 3.2 must continue to work in Python 3.3 once PEP 3151 is accepted, but also for Python 2 code ported straight to Python 3.3. Given the PEP's emphasis on "useful compatibility", I think this will be the case. Do be prepared for complaints about compatibility for careless code though - there's a ton of that out in the wild, and people will always complain with their "working" code breaks due to an upgrade. Be *very* explicit about this in the release notes and NEWS file, and put your asbestos underoos on. On the plus side, there's not so much Python 3 code to break :). Also, do clearly explain any required migration strategy for existing code, probably in this PEP. Have you considered the impact of this PEP on other Python implementations? My hazy memory of Jython tells me that errnos don't really leak into Java and thus Jython much, but what about PyPy and IronPython? E.g. step 1's deprecation strategy seems pretty CPython-centric. As for step 1 (coalescing the errors). This makes sense and I'm generally agreeable, but I'm wondering whether it's best to re-use IOError for this rather than introduce a new exception. Not that I can think of a good name for that. I'm just not totally convinced that existing code when upgrading to Python 3.3 won't introduce silent failures. If an existing error is to be re-used for this, I'm torn on whether IOError or OSError is a better choice. Popularity aside, OSError *feels* more right. What is the impact of the PEP on tools such as 2to3 and 3to2? Just to be clear, am I right that (on POSIX systems at least) IOError and its subclasses will always have an errno attribute still? And that anything raising an exception (e.g. via PyErr_SetFromErrno) other than the new ones will raise IOError? I also think that rather than transforming exception when raised from Python, i.e. via __new__ hackery, perhaps it should be a ValueError in its own right to raise IOError with an error represented by one of the subclasses. Chained exceptions would mean that the original exception needn't get lost. I surveyed some of my own code and observed (as others have) that EISDIR and ENOTDIR are pretty rare. I found more examples of ECHILD and ESRCH than the former two. How'd you like to add those two to make your BDFOP happy? :) What follows are some crazier ideas. I'm just throwing them out there, not necessarily suggesting they should go into the PEP. The new syntax (e.g. if clause on except) is certainly appealing at first glance, and might be of more general use for Python, but I agree with the problems as stated in the PEP. However, there might be a few things that *can* be done to make even the uncommon cases easier. E.g. What if all the errno symbolic names were mapped as attributes on IOError? The only advantage of that would be to eliminate the need to import errno, or for the ugly `e.errno == errno.ENOENT` stuff. That would then be rewritten as `e.errno == IOError.ENOENT`. A mild savings to be sure, but still. How dumb/useless/unworkable would it be to add an __future__ to switch from the old hierarchy to the new one? Probably pretty. ;) What about an api that applications/libraries could use to add additional exceptions based on other errnos they cared about? This could be consulted in PyErr_SetFromErrno() and raised instead of IOError. Okay, yeah, that's probably pretty dumb too. Anyway, that's all I have. I certainly feel like this PEP is pretty close to being accepted. Good work! -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From victor.stinner at haypocalc.com Wed Aug 24 00:27:37 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 00:27:37 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110823001440.433a0f1f@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> Message-ID: <201108240027.37788.victor.stinner@haypocalc.com> Le mardi 23 ao?t 2011 00:14:40, Antoine Pitrou a ?crit : > Hello, > > On Mon, 22 Aug 2011 14:58:51 -0400 > > Torsten Becker wrote: > > I have implemented an initial version of PEP 393 -- "Flexible String > > Representation" as part of my Google Summer of Code project. My patch > > is hosted as a repository on bitbucket [1] and I created a related > > issue on the bug tracker [2]. I posted documentation for the current > > state of the development in the wiki [3]. > > A couple of minor comments: > > - ?The UTF-8 decoding fast path for ASCII only characters was removed > and replaced with a memcpy if the entire string is ASCII.? > The fast path would still be useful for mostly-ASCII strings, which > are extremely common (unless UTF-8 has become a no-op?). I posted a patch to re-add it: http://bugs.python.org/issue12819#msg142867 Victor From victor.stinner at haypocalc.com Wed Aug 24 00:38:00 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 00:38:00 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110823001440.433a0f1f@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> Message-ID: <201108240038.00801.victor.stinner@haypocalc.com> Le mardi 23 ao?t 2011 00:14:40, Antoine Pitrou a ?crit : > - You could try to run stringbench, which can be found at > http://svn.python.org/projects/sandbox/trunk/stringbench (*) > and there's iobench (the text mode benchmarks) in the Tools/iobench > directory. Some raw numbers. stringbench: "147.07 203.07 72.4 TOTAL" for the PEP 393 "146.81 140.39 104.6 TOTAL" for default => PEP is 45% slower run test_unicode 50 times: 0m19.487s for PEP 0m17.187s for default => PEP is 13% slower time ./python -m test -j4 ("real" time): 3m16.886s (334 tests) for the PEP 3m21.984s (335 tests) for default ... default has 1 more test! Only 13% slower on test_unicode is *good*. There are still a lot of code using the legacy API in unicode.c, so it cam be much better. stringbench only shows the overhead of the conversion from compact unicode to Py_UNICODE* (wchar_t*). stringlib does still use the legacy API. Victor From victor.stinner at haypocalc.com Wed Aug 24 00:46:16 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 00:46:16 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: Message-ID: <201108240046.16058.victor.stinner@haypocalc.com> Le lundi 22 ao?t 2011 20:58:51, Torsten Becker a ?crit : > [1]: http://www.python.org/dev/peps/pep-0393 state: lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2 next 2 bits (mask 0x0C) - form of str: 00 => reserved 01 => 1 byte (Latin-1) 10 => 2 byte (UCS-2) 11 => 4 byte (UCS-4); next bit (mask 0x10): 1 if str memory follows PyUnicodeObject kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). Victor From victor.stinner at haypocalc.com Wed Aug 24 00:56:48 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 00:56:48 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <201108240046.16058.victor.stinner@haypocalc.com> References: <201108240046.16058.victor.stinner@haypocalc.com> Message-ID: <201108240056.48170.victor.stinner@haypocalc.com> Le mercredi 24 ao?t 2011 00:46:16, Victor Stinner a ?crit : > Le lundi 22 ao?t 2011 20:58:51, Torsten Becker a ?crit : > > [1]: http://www.python.org/dev/peps/pep-0393 > > state: > lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2 > next 2 bits (mask 0x0C) - form of str: > 00 => reserved > 01 => 1 byte (Latin-1) > 10 => 2 byte (UCS-2) > 11 => 4 byte (UCS-4); > next bit (mask 0x10): 1 if str memory follows PyUnicodeObject > > kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still > necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). If it can be removed, it would be nice to have kind in [0; 2] instead of kind in [1; 2], to be able to have a list (of 3 items) => callback or label. I suppose that compilers prefer a switch with all cases defined, 0 a first item and contiguous values. We may need an enum. Victor From tjreedy at udel.edu Wed Aug 24 01:04:10 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 23 Aug 2011 19:04:10 -0400 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: References: <20110823202004.0bb63490@pitrou.net> Message-ID: On 8/23/2011 2:46 PM, Brian Curtin wrote: > I don't care all that much but I'm reminded of the .NET > FileSystemWatcher class, so put me down for +0.5 on FileSystemError. For other reasons, I am at lease +.5 for FileSystemError also. -- Terry Jan Reedy From solipsis at pitrou.net Wed Aug 24 01:57:56 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 24 Aug 2011 01:57:56 +0200 Subject: [Python-Dev] PEP 3151 from the BDFOP References: <20110823170357.3b3ab2fc@resist.wooz.org> Message-ID: <20110824015756.51cdceac@pitrou.net> Hi, > One guiding principle for me is that we should keep the abstraction as thin as > possible. In particular, I'm concerned about mapping multiple errnos into a > single Error. For example both EPIPE and ESHUTDOWN mapping to BrokePipeError, > or EACESS or EPERM to PermissionError. I think we should resist this, so that > one errno maps to exactly one Error. Where grouping is desired, Python > already has mechanisms to deal with that, e.g. superclasses and multiple > inheritance. Therefore, I think it would be better to have > > + FileSystemPermissionError > + AccessError (EACCES) > + PermissionError (EPERM) I'm not sure that's a good idea: - EPERM is not only about filesystem permissions, see for example http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html - EACCES and EPERM as a low-level distinction makes sense, but at the Python programmer's high-level point of view, the AccessError / PermissionError distinction does not seem to convey any useful meaning. (or perhaps that's just my bad understanding of English) - the "errno" attribute is still there (and still displayed - see below) for people who know their system calls and want to inspect the original error code > Also, some of the artificial hierarchy introduced in the PEP may > not be necessary (e.g. the faux superclass FileSystemPermissionError above). > This might lead to the elimination of FileSystemError as some have suggested > (I too question its utility). Yes, FileSystemError might be removed. I thought that it would be useful, in some library routines, to catch all filesystem-related errors indistinctly, but it's not a complete catchall actually (for example, AccessError is outside of the FileSystemError subtree). > Similarly, I think it would be helpful to have the errno name (e.g. ENOENT) in > the error message string. That way, it won't get in the way for most code, > but would be usefully printed out for uncaught exceptions. Agreed, but I think that's a feature request quite orthogonal from the PEP. The errno *number* is still printed as it was before: >>> open("foo") Traceback (most recent call last): File "", line 1, in FileNotFoundError: [Errno 2] No such file or directory: 'foo' (see e.g. http://bugs.python.org/issue12762) > A second guiding principle should be that careful code that works in Python > 3.2 must continue to work in Python 3.3 once PEP 3151 is accepted, but also > for Python 2 code ported straight to Python 3.3. I don't porting straight to 3.3 would make a difference, especially now that the idea of deprecating old exception names has been abandoned. > Do be prepared for > complaints about compatibility for careless code though - there's a ton of > that out in the wild, and people will always complain with their "working" > code breaks due to an upgrade. Be *very* explicit about this in the release > notes and NEWS file, and put your asbestos underoos on. I'll take care about that :) > Have you considered the impact of this PEP on other Python implementations? > My hazy memory of Jython tells me that errnos don't really leak into Java and > thus Jython much, but what about PyPy and IronPython? E.g. step 1's > deprecation strategy seems pretty CPython-centric. Alternative implementations already have to implement errno codes in a way or another if they want to have a chance of running existing code. So I don't think the PEP makes much of a difference for them. But their implementors can give their opinion on this. > As for step 1 (coalescing the errors). This makes sense and I'm generally > agreeable, but I'm wondering whether it's best to re-use IOError for this > rather than introduce a new exception. Not that I can think of a good name > for that. I'm just not totally convinced that existing code when upgrading to > Python 3.3 won't introduce silent failures. If an existing error is to be > re-used for this, I'm torn on whether IOError or OSError is a better choice. > Popularity aside, OSError *feels* more right. I don't have any personal preference. Previous discussions seemed to indicate people preferred IOError. But changing the implementation to OSError would be simple. I agree OSError feels slightly more right, as in more generic. > What is the impact of the PEP on tools such as 2to3 and 3to2? I'd say none for 2to3. For 3to2 I'm not sure. Obviously if you write code taking advantage of new features, it will be difficultly back-portable to 2.x. But that's not specific to PEP 3151. Python 3.2 has lot such stuff already: http://docs.python.org/py3k/whatsnew/3.2.html > Just to be clear, am I right that (on POSIX systems at least) IOError and its > subclasses will always have an errno attribute still? Yes! > And that anything > raising an exception (e.g. via PyErr_SetFromErrno) other than the new ones > will raise IOError? I'm not sure I understand the question precisely. The errno mapping mechanism is implemented in IOError.__new__, but it gets called only if the class is exactly IOError, not a subclass: >>> IOError(errno.EPERM, "foo") PermissionError(1, 'foo') >>> class MyIOError(IOError): pass ... >>> MyIOError(errno.EPERM, "foo") MyIOError(1, 'foo') Using IOError.__new__ is the easiest way to ensure that all code raising IO errors takes advantage of the errno mapping. Otherwise you may get APIs raising the proper subclasses, and other APIs always raising base IOError (it doesn't happen often, but some Python library code raises an IOError with an explicit errno). > I also think that rather than transforming exception when raised from Python, > i.e. via __new__ hackery, perhaps it should be a ValueError in its own right > to raise IOError with an error represented by one of the subclasses. That would make it harder to keep compatibility while adding new subclasses in future Python versions. Imagine a lot of people lobby for a dedicated EBADF subclass and obtain it, then IOError(EBADF, "some message") would suddenly raise a ValueError. Or do I misunderstand your proposal? > I surveyed some of my own code and observed (as others have) that EISDIR and > ENOTDIR are pretty rare. Yes, I think they are common in shutil-like code. > I found more examples of ECHILD and ESRCH than the > former two. How'd you like to add those two to make your BDFOP happy? :) Wow, I didn't know ESRCH. How would you call the respective exceptions? - ChildProcessError for ECHILD? - ProcessLookupError for ESRCH? > What if all the errno symbolic names were mapped as attributes on IOError? > The only advantage of that would be to eliminate the need to import errno, or > for the ugly `e.errno == errno.ENOENT` stuff. That would then be rewritten as > `e.errno == IOError.ENOENT`. A mild savings to be sure, but still. Hmm, I guess that's explorable as an orthogonal idea. > How dumb/useless/unworkable would it be to add an __future__ to switch from > the old hierarchy to the new one? Probably pretty. ;) Well, the hierarchy is built-in, since it's about standard exceptions. Also, you usually get the exception from some library API, so a __future__ in your own module would not achieve much. > What about an api that applications/libraries could use to add additional > exceptions based on other errnos they cared about? This could be consulted in > PyErr_SetFromErrno() and raised instead of IOError. Okay, yeah, that's > probably pretty dumb too. The problem is that behaviour becomes inconsistent accross libraries. I'm not sure that's very helpful to the user. Regards Antoine. From tjreedy at udel.edu Wed Aug 24 02:46:00 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 23 Aug 2011 20:46:00 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E537EEC.1070602@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> Message-ID: On 8/23/2011 6:20 AM, "Martin v. L?wis" wrote: > Am 23.08.2011 11:46, schrieb Xavier Morel: >> Mostly ascii is pretty common for western-european languages (French, for >> instance, is probably 90 to 95% ascii). It's also a risk in english, when >> the writer "correctly" spells foreign words (r?sum? and the like). > > I know - I still question whether it is "extremely common" (so much as > to justify a special case). I.e. on what application with what dataset > would you gain what speedup, at the expense of what amount of extra > lines, and potential slow-down for other datasets? [snip] > In the PEP 393 approach, if the string has a two-byte representation, > each character needs to widened to two bytes, and likewise for four > bytes. So three separate copies of the unrolled loop would be needed, > one for each target size. I fully support the declared purpose of the PEP, which I understand to be to have a full,correct Unicode implementation on all new Python releases without paying unnecessary space (and consequent time) penalties. I think the erroneous length, iteration, indexing, and slicing for strings with non-BMP chars in narrow builds needs to be fixed for future versions. I think we should at least consider alternatives to the PEP393 solution of double or quadrupling space if needed for at least one char. In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. I believe the same idea would work for utf8 and the mostly-ascii case. The main difference is that non-ascii chars have various byte sizes rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So the offset correction would not simply be the bisect-left return but would require another lookup byte-index = char-index + offsets[bisect-left(cpdex, char-index)] If possible, I would have the with-index-array versions be separate subtypes, as in utf16.py. I believe either index-array implementation might benefit from a subtype for single multi-unit chars, as a single non-ASCII or non-BMP char does not need an auxiliary [0] array and a senseless lookup therein but does need its length fixed at 1 instead of the number of base array units. -- Terry Jan Reedy From tjreedy at udel.edu Wed Aug 24 02:46:06 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 23 Aug 2011 20:46:06 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E53A950.30005@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> Message-ID: On 8/23/2011 9:21 AM, Victor Stinner wrote: > Le 23/08/2011 15:06, "Martin v. L?wis" a ?crit : >> Well, things have to be done in order: >> 1. the PEP needs to be approved >> 2. the performance bottlenecks need to be identified >> 3. optimizations should be applied. > > I would not vote for the PEP if it slows down Python, especially if it's > much slower. But Torsten says that it speeds up Python, which is > surprising. I have to do my own benchmarks :-) The current UCS2 Unicode string implementation, by design, quickly gives WRONG answers for len(), iteration, indexing, and slicing if a string contains any non-BMP (surrogate pair) Unicode characters. That may have been excusable when there essentially were no such extended chars, and the few there were were almost never used. But now there are many more, with more being added to each Unicode edition. They include cursive Math letters that are used in English documents today. The problem will slowly get worse and Python, at least on Windows, will become a language to avoid for dependable Unicode document processing. 3.x needs a proper Unicode implementation that works for all strings on all builds. utf16.py, attached to http://bugs.python.org/issue12729 prototypes a different solution than the PEP for the above problems for the 'mostly BMP' case. I will discuss it in a different post. -- Terry Jan Reedy From torsten.becker at gmail.com Wed Aug 24 04:35:32 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Tue, 23 Aug 2011 22:35:32 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110823001440.433a0f1f@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> Message-ID: On Mon, Aug 22, 2011 at 18:14, Antoine Pitrou wrote: > - You could trim the debug results from the benchmark results, this may > ?make them more readable. Good point, I removed them from the wiki page. On Tue, Aug 23, 2011 at 18:38, Victor Stinner wrote: > Le mardi 23 ao?t 2011 00:14:40, Antoine Pitrou a ?crit : >> - You could try to run stringbench, which can be found at >> ? http://svn.python.org/projects/sandbox/trunk/stringbench (*) >> ? and there's iobench (the text mode benchmarks) in the Tools/iobench >> ? directory. > > Some raw numbers. > [...] Thank you Victor for running stringbench, I did not get to it in time. Regards, Torsten From ncoghlan at gmail.com Wed Aug 24 04:31:12 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 12:31:12 +1000 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <4E53FD3B.7000705@pearwood.info> References: <20110823202004.0bb63490@pitrou.net> <4E53FD3B.7000705@pearwood.info> Message-ID: On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano wrote: > Antoine Pitrou wrote: >> >> Hello, >> >> When reviewing the PEP 3151 implementation (*), Ezio commented that >> "FileSystemError" looks a bit strange and that "FilesystemError" would >> be a better spelling. What is your opinion? > > It's a file system (two words), not filesystem (not in any dictionary or > spell checker I've ever used). I rarely find spell checkers to be useful sources of data on correct spelling of technical jargon (and the computing usage of the term 'filesystem' definitely qualifies as jargon). > (Nor do we write filingsystem, governmentsystem, politicalsystem or > schoolsystem. This is English, not German.) Personally, I think 'filesystem' is a portmanteau in the process of coming into existence (as evidenced by usage like 'FHS' standing for 'Filesystem Hierarchy Standard'). However, the two word form is still useful at times, particularly for disambiguation of acronyms (as evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and 'Google File System'). The Wikipedia article on the topic mixes and matches the two forms, but overall does favour the two word form. Since I tend to use the one word 'filesystem' form myself (ditto for 'filename'), I'm +1 for FilesystemError, but I'm only -0 for FileSystemError (so I expect that will be the option chosen, given other responses). Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From torsten.becker at gmail.com Wed Aug 24 04:39:49 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Tue, 23 Aug 2011 22:39:49 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <1314101745.3485.18.camel@localhost.localdomain> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> Message-ID: On Tue, Aug 23, 2011 at 08:15, Antoine Pitrou wrote: > So why would you need three separate implementation of the unrolled > loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR. The WRITE_FLEXIBLE_OR_WSTR macro does a check for kind and then writes. Using this macro for the fast path would be inefficient, to have a real fast path, you would need a outer if to check for kind and then in each condition body the matching access to the string (1, 2, or 4 bytes) and for each body also write 4 or 8 times (guarded by #ifdef, depending on platform). As all these cases bloated up the C code, we went for the simple solution with the goal of profiling the code again afterwards to see where the new performance bottlenecks would be. > Even without taking into account the unrolled loop, I wonder how much > slower UTF-8 decoding becomes with that approach, by the way. Instead of > testing the "kind" variable at each loop iteration, using a > stringlib-like approach may be a better deal IMO. To me this feels like this would complicate the C source code and decrease readability. For each function you would need a wrapper which does the kind checking logic and then, in a separate file, the implementation of the function which then gets included three times for each character width. Regards, Torsten From torsten.becker at gmail.com Wed Aug 24 04:41:59 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Tue, 23 Aug 2011 22:41:59 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <201108240027.37788.victor.stinner@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <201108240027.37788.victor.stinner@haypocalc.com> Message-ID: On Tue, Aug 23, 2011 at 18:27, Victor Stinner wrote: > I posted a patch to re-add it: > http://bugs.python.org/issue12819#msg142867 Thank you for the patch! Note that this patch adds the fast path only to the helper function which determines the length of the string and the maximum character. The decoding part is still without a fast path for ASCII runs. Regards, Torsten From ncoghlan at gmail.com Wed Aug 24 04:42:58 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 12:42:58 +1000 Subject: [Python-Dev] Planned PEP status changes Message-ID: Unless I hear any objections, I plan to adjust the current PEP statuses as follows some time this weekend: Move from Accepted to Finished: 389 argparse - New Command Line Parsing Module Bethard 391 Dictionary-Based Configuration For Logging Sajip 3108 Standard Library Reorganization Cannon 3135 New Super Spealman, Delaney, Ryan Move from Accepted to Withdrawn (with a reference to Reid Kleckner's blog post) 3146 Merging Unladen Swallow into CPython Winter, Yasskin, Kleckner The PEP 3118 enhanced buffer protocol has some ongoing semantic and implementation issues still to be worked out, so I plan to leave that at Accepted. Ditto for PEP 3121 (extension module finalisation), since that doesn't play nicely with the current 'set everything to None' approach to breaking cycles during module finalisation. The other Accepted PEPs are either packaging standards related or genuinely not implemented yet. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From torsten.becker at gmail.com Wed Aug 24 04:41:20 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Tue, 23 Aug 2011 22:41:20 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110823160820.08754ffe@pitrou.net> References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou wrote: > Macros are useful to shield the abstraction from the implementation. If > you access the members directly, and the unicode object is represented > differently in some future version of Python (say e.g. with tagged > pointers), your code doesn't compile anymore. I agree with Antoine, from the experience of porting C code from 3.2 to the PEP 393 unicode API, the additional encapsulation by macros made it much easier to change the implementation of what is a field, what is a field's actual name, and what needs to be calculated through a function. So, I would like to keep primary access as a macro but I see the point that it would make the struct clearer to access and I would not mind changing the struct to use a union. But then most access currently is through macros so I am not sure how much benefit the union would bring as it mostly complicates the struct definition. Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL"). Regards, Torsten From ncoghlan at gmail.com Wed Aug 24 04:51:29 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 12:51:29 +1000 Subject: [Python-Dev] PEP 3151 from the BDFOP In-Reply-To: <20110824015756.51cdceac@pitrou.net> References: <20110823170357.3b3ab2fc@resist.wooz.org> <20110824015756.51cdceac@pitrou.net> Message-ID: On Wed, Aug 24, 2011 at 9:57 AM, Antoine Pitrou wrote: > I don't have any personal preference. Previous discussions seemed to > indicate people preferred IOError. But changing the implementation to > OSError would be simple. I agree OSError feels slightly more right, as > in more generic. IIRC, the preference for IOError was formed when we were going to deprecate the 'legacy' names. Now that using the old names won't trigger any kind of warning, +1 for using OSError as the official name of the base class with IOError as a legacy alias. >> And that anything >> raising an exception (e.g. via PyErr_SetFromErrno) other than the new ones >> will raise IOError? > > I'm not sure I understand the question precisely. The errno mapping > mechanism is implemented in IOError.__new__, but it gets called only if > the class is exactly IOError, not a subclass: > >>>> IOError(errno.EPERM, "foo") > PermissionError(1, 'foo') >>>> class MyIOError(IOError): pass > ... >>>> MyIOError(errno.EPERM, "foo") > MyIOError(1, 'foo') > > Using IOError.__new__ is the easiest way to ensure that all code > raising IO errors takes advantage of the errno mapping. Otherwise you > may get APIs raising the proper subclasses, and other APIs always > raising base IOError (it doesn't happen often, but some Python > library code raises an IOError with an explicit errno). It's also the natural place to put the errno->exception type mapping so that existing code will raise the new errors without requiring modification. We could spell it as a new class method ("from_errno" or similar), but there isn't any ambiguity in doing it directly in __new__, so a class method seems pointlessly inconvenient. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From torsten.becker at gmail.com Wed Aug 24 04:56:57 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Tue, 23 Aug 2011 22:56:57 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <201108240056.48170.victor.stinner@haypocalc.com> References: <201108240046.16058.victor.stinner@haypocalc.com> <201108240056.48170.victor.stinner@haypocalc.com> Message-ID: On Tue, Aug 23, 2011 at 18:56, Victor Stinner wrote: >> kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still >> necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). > > If it can be removed, it would be nice to have kind in [0; 2] instead of kind > in [1; 2], to be able to have a list (of 3 items) => callback or label. It is also used in PyUnicode_DecodeUTF8Stateful() and there might be some cases which I missed converting checks for 0 when I introduced the macro. The question was more if this should be written as 0 or as a named constant. I preferred the named constant for readability. An alternative would be to have kind values be the same as the number of bytes for the string representation so it would be 0 (wstr), 1 (1-byte), 2 (2-byte), or 4 (4-byte). I think the value for wstr/uninitialized/reserved should not be removed. The wstr representation is still used in the error case in the utf8 decoder because these strings can be resized. Also having one designated value for "uninitialized" limits comparisons in the affected functions to the kind value, otherwise they would need to check the str field for NULL to determine in which buffer to write a character. > I suppose that compilers prefer a switch with all cases defined, 0 a first item > and contiguous values. We may need an enum. During the Summer of Code, Martin and I did a experiment with GCC and it did not seem to produce a jump table as an optimization for three cases but generated comparison instructions anyway. I am not sure how much we should optimize for potential compiler optimizations here. Regards, Torsten From scott+python-dev at scottdial.com Wed Aug 24 06:59:26 2011 From: scott+python-dev at scottdial.com (Scott Dial) Date: Wed, 24 Aug 2011 00:59:26 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <201108240038.00801.victor.stinner@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <201108240038.00801.victor.stinner@haypocalc.com> Message-ID: <4E54852E.9000601@scottdial.com> On 8/23/2011 6:38 PM, Victor Stinner wrote: > Le mardi 23 ao?t 2011 00:14:40, Antoine Pitrou a ?crit : >> - You could try to run stringbench, which can be found at >> http://svn.python.org/projects/sandbox/trunk/stringbench (*) >> and there's iobench (the text mode benchmarks) in the Tools/iobench >> directory. > > Some raw numbers. > > stringbench: > "147.07 203.07 72.4 TOTAL" for the PEP 393 > "146.81 140.39 104.6 TOTAL" for default > => PEP is 45% slower I ran the same benchmark and couldn't make a distinction in performance between them: pep-393.txt 182.17 175.47 103.8 TOTAL cpython.txt 183.26 177.97 103.0 TOTAL pep-393-wide-unicode.txt 181.61 198.69 91.4 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7 TOTAL I ran it a couple times and have seen either default or pep-393 being up to +/- 10 sec slower on the unicode tests. The results of the 8-bit string tests seem to have less variance on my test machine. > run test_unicode 50 times: > 0m19.487s for PEP > 0m17.187s for default > => PEP is 13% slower $ time ./python -m test `python -c 'print "test_unicode " * 50'` pep-393-wide-unicode.txt real 0m33.409s cpython-wide-unicode.txt real 0m33.489s Nothing in it for me.. except your system is obviously faster, in general. -- Scott Dial scott at scottdial.com From ezio.melotti at gmail.com Wed Aug 24 07:39:24 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Wed, 24 Aug 2011 08:39:24 +0300 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: References: <20110823202004.0bb63490@pitrou.net> <4E53FD3B.7000705@pearwood.info> Message-ID: <4E548E8C.8040701@gmail.com> On 24/08/2011 5.31, Nick Coghlan wrote: > On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano wrote: >> (Nor do we write filingsystem, governmentsystem, politicalsystem or >> schoolsystem. This is English, not German.) > Personally, I think 'filesystem' is a portmanteau in the process of > coming into existence (as evidenced by usage like 'FHS' standing for > 'Filesystem Hierarchy Standard'). However, the two word form is still > useful at times, particularly for disambiguation of acronyms (as > evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and > 'Google File System'). The Wikipedia article on the topic mixes and > matches the two forms, but overall does favour the two word form. > > Since I tend to use the one word 'filesystem' form myself (ditto for > 'filename'), I'm +1 for FilesystemError, but I'm only -0 for > FileSystemError (so I expect that will be the option chosen, given > other responses). This pretty much summarizes my thoughts. I saw the wiki article using both and since I consider 'filesystem' a single word I was wondering if anyone else preferred FilesystemError. I'm totally fine with FileSystemError too though, if most people prefer it. Best Regards, Ezio Melotti > > Regards, > Nick. > From stefan_ml at behnel.de Wed Aug 24 08:57:54 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Aug 2011 08:57:54 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: Torsten Becker, 24.08.2011 04:41: > Also, common, now simple, checks for "unicode->str == NULL" would look > more ambiguous with a union ("unicode->str.latin1 == NULL"). You could just add yet another field "any", i.e. union { unsigned char* latin1; Py_UCS2* ucs2; Py_UCS4* ucs4; void* any; } str; That way, the above test becomes if (!unicode->str.any) or if (unicode->str.any == NULL) Or maybe even call it "initialised" to match the intended purpose: if (!unicode->str.initialised) That being said, I don't mind "unicode->str.latin1 == NULL" either, given that it will (as mentioned by others) be hidden behind a macro most of the time anyway. Stefan From stephen at xemacs.org Wed Aug 24 09:51:40 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 24 Aug 2011 16:51:40 +0900 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: References: <20110823202004.0bb63490@pitrou.net> <4E53FD3B.7000705@pearwood.info> Message-ID: <87sjorb62b.fsf@uwakimon.sk.tsukuba.ac.jp> Nick Coghlan writes: > Since I tend to use the one word 'filesystem' form myself (ditto for > 'filename'), I'm +1 for FilesystemError, but I'm only -0 for > FileSystemError (so I expect that will be the option chosen, given > other responses). I slightly prefer FilesystemError because it parses unambiguously. Cf. FileSystemError vs FileUserError. From v+python at g.nevcal.com Wed Aug 24 09:56:56 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 24 Aug 2011 00:56:56 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> Message-ID: <4E54AEC8.7040702@g.nevcal.com> On 8/23/2011 5:46 PM, Terry Reedy wrote: > On 8/23/2011 6:20 AM, "Martin v. L?wis" wrote: >> Am 23.08.2011 11:46, schrieb Xavier Morel: >>> Mostly ascii is pretty common for western-european languages >>> (French, for >>> instance, is probably 90 to 95% ascii). It's also a risk in english, >>> when >>> the writer "correctly" spells foreign words (r?sum? and the like). >> >> I know - I still question whether it is "extremely common" (so much as >> to justify a special case). I.e. on what application with what dataset >> would you gain what speedup, at the expense of what amount of extra >> lines, and potential slow-down for other datasets? > [snip] >> In the PEP 393 approach, if the string has a two-byte representation, >> each character needs to widened to two bytes, and likewise for four >> bytes. So three separate copies of the unrolled loop would be needed, >> one for each target size. > > I fully support the declared purpose of the PEP, which I understand to > be to have a full,correct Unicode implementation on all new Python > releases without paying unnecessary space (and consequent time) > penalties. I think the erroneous length, iteration, indexing, and > slicing for strings with non-BMP chars in narrow builds needs to be > fixed for future versions. I think we should at least consider > alternatives to the PEP393 solution of double or quadrupling space if > needed for at least one char. > > In utf16.py, attached to http://bugs.python.org/issue12729 > I propose for consideration a prototype of different solution to the > 'mostly BMP chars, few non-BMP chars' case. Rather than expand every > character from 2 bytes to 4, attach an array cpdex of character (ie > code point, not code unit) indexes. Then for indexing and slicing, the > correction is simple, simpler than I first expected: > code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) > where code-unit-index is the adjusted index into the full underlying > double-byte array. This adds a time penalty of log2(len(cpdex)), but > avoids most of the space penalty and the consequent time penalty of > moving more bytes around and increasing cache misses. > > I believe the same idea would work for utf8 and the mostly-ascii case. > The main difference is that non-ascii chars have various byte sizes > rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. > So the offset correction would not simply be the bisect-left return > but would require another lookup > byte-index = char-index + offsets[bisect-left(cpdex, char-index)] > > If possible, I would have the with-index-array versions be separate > subtypes, as in utf16.py. I believe either index-array implementation > might benefit from a subtype for single multi-unit chars, as a single > non-ASCII or non-BMP char does not need an auxiliary [0] array and a > senseless lookup therein but does need its length fixed at 1 instead > of the number of base array units. > So am I correctly reading between the lines when, after reading this thread so far, and the complete issue discussion so far, that I see a PEP 393 revision or replacement that has the following characteristics: 1) Narrow builds are dropped. The conceptual idea of PEP 393 eliminates the need for narrow builds, as the internal string data structures adjust to the actuality of the data. If you want a narrow build, just don't use code points > 65535. 2) There are more, or different, internal kinds of strings, which affect the processing patterns. Here is an enumeration of the ones I can think of, as complete as possible, with recognition that benchmarking and clever algorithms may eliminate the need for some of them. a) all ASCII b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This kind may not be able to support a "mostly" variation, and may be no more efficient than case b). But it might also be popular in parts of Europe :) And appropriate benchmarks may discover whether or not it has worth. c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient e) 16-bit codepoints f) UTF-16 with clever indexing/caching to be efficient g) 32-bit codepoints h) UTF-32 When instantiating a str, a new parameter or subtype would restrict the implementation to using only a), b), d), f), and h) when fully conformant Unicode behavior is desired. No lone surrogates, no out of range code points, no illegal codepoints. A default str would prefer a), b), c), e), and g) for efficiency and flexibility. When manipulations outside of Unicode are necessary [Windows seems to use e) for example, suffering from the same sorts of backward compatibility problems as Python, in some ways], the default str type would permit them, using e) and g) kinds of representations. Although the surrogate escape codec only uses prefix surrogates (or is it only suffix ones?) which would never match up, note that a conversion from 16-bit codepoints to other formats may produce matches between the results of the surrogate escape codec, and other unchecked data introduced by the user/program. A method should be provided to validate and promote a string from default, unchecked str type to the subtype or variation that enforces Unicode, if it qualifies; if it doesn't qualify, an exception would be raised by the method. (This could generally be done in place if the value is bound to only a single variable, but would generate a copy and rebind the variable to the promoted copy if it is multiply referenced?) Another parameter or subtype of the conformant str would add grapheme support, which has a different set of rules for the clever indexing/caching, but could be applied to any of a)?, c)?, d), f), or h). ? It is unnecessary to apply clever indexing/caching to a) and c) kinds of string internals, because there is a one-to-one mapping between bytes, codepoints, and graphemes in these ranges. So plain array indexing can be used in the implementation of these kinds. -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at haypocalc.com Wed Aug 24 10:10:17 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 10:10:17 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: <4E54B1E9.8080404@haypocalc.com> Le 24/08/2011 04:41, Torsten Becker a ?crit : > On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou wrote: >> Macros are useful to shield the abstraction from the implementation. If >> you access the members directly, and the unicode object is represented >> differently in some future version of Python (say e.g. with tagged >> pointers), your code doesn't compile anymore. > > I agree with Antoine, from the experience of porting C code from 3.2 > to the PEP 393 unicode API, the additional encapsulation by macros > made it much easier to change the implementation of what is a field, > what is a field's actual name, and what needs to be calculated through > a function. > > So, I would like to keep primary access as a macro but I see the point > that it would make the struct clearer to access and I would not mind > changing the struct to use a union. But then most access currently is > through macros so I am not sure how much benefit the union would bring > as it mostly complicates the struct definition. An union helps debugging in gdb: you don't have to cast manually to unsigned char*/Py_UCS2*/Py_UCS4*. > Also, common, now simple, checks for "unicode->str == NULL" would look > more ambiguous with a union ("unicode->str.latin1 == NULL"). We can rename "str" to something else, to "data" for example. Victor From victor.stinner at haypocalc.com Wed Aug 24 10:11:50 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 10:11:50 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54852E.9000601@scottdial.com> References: <20110823001440.433a0f1f@pitrou.net> <201108240038.00801.victor.stinner@haypocalc.com> <4E54852E.9000601@scottdial.com> Message-ID: <4E54B246.5020008@haypocalc.com> Le 24/08/2011 06:59, Scott Dial a ?crit : > On 8/23/2011 6:38 PM, Victor Stinner wrote: >> Le mardi 23 ao?t 2011 00:14:40, Antoine Pitrou a ?crit : >>> - You could try to run stringbench, which can be found at >>> http://svn.python.org/projects/sandbox/trunk/stringbench (*) >>> and there's iobench (the text mode benchmarks) in the Tools/iobench >>> directory. >> >> Some raw numbers. >> >> stringbench: >> "147.07 203.07 72.4 TOTAL" for the PEP 393 >> "146.81 140.39 104.6 TOTAL" for default >> => PEP is 45% slower > > I ran the same benchmark and couldn't make a distinction in performance > between them: Hum, are you sure that you used the PEP 383? Make sure that you are using the pep-383 branch! I also started my benchmark on the wrong branch :-) Victor From victor.stinner at haypocalc.com Wed Aug 24 10:17:58 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 10:17:58 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <201108240027.37788.victor.stinner@haypocalc.com> Message-ID: <4E54B3B6.1020205@haypocalc.com> Le 24/08/2011 04:41, Torsten Becker a ?crit : > On Tue, Aug 23, 2011 at 18:27, Victor Stinner > wrote: >> I posted a patch to re-add it: >> http://bugs.python.org/issue12819#msg142867 > > Thank you for the patch! Note that this patch adds the fast path only > to the helper function which determines the length of the string and > the maximum character. The decoding part is still without a fast path > for ASCII runs. Ah? If utf8_max_char_size_and_has_errors() returns no error hand maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-) maxchar = utf8_max_char_size_and_has_errors(s, size, &unicode_size, &has_errors); if (has_errors) { ... } else { unicode = (PyUnicodeObject *)PyUnicode_New(unicode_size, maxchar); if (!unicode) return NULL; /* When the string is ASCII only, just use memcpy and return. */ if (maxchar < 128) { assert(unicode_size == size); Py_MEMCPY(PyUnicode_1BYTE_DATA(unicode), s, unicode_size); return (PyObject *)unicode; } ... } But yes, my patch only optimize ASCII only strings, not "mostly-ASCII" strings (e.g. 100 ASCII + 1 latin1 character). It can be optimized later. I didn't benchmark my patch. Victor From martin at v.loewis.de Wed Aug 24 10:18:20 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 24 Aug 2011 10:18:20 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54AEC8.7040702@g.nevcal.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E54AEC8.7040702@g.nevcal.com> Message-ID: <4E54B3CC.9040900@v.loewis.de> > So am I correctly reading between the lines when, after reading this > thread so far, and the complete issue discussion so far, that I see a > PEP 393 revision or replacement that has the following characteristics: > > 1) Narrow builds are dropped. PEP 393 already drops narrow builds. > 2) There are more, or different, internal kinds of strings, which affect > the processing patterns. This is the basic idea of PEP 393. > a) all ASCII > b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This > kind may not be able to support a "mostly" variation, and may be no more > efficient than case b). But it might also be popular in parts of Europe This two cases are already in PEP 393. > c) mostly ASCII (utf8) with clever indexing/caching to be efficient > d) UTF-8 with clever indexing/caching to be efficient I see neither a need nor a means to consider these. > e) 16-bit codepoints These are in PEP 393. > f) UTF-16 with clever indexing/caching to be efficient Again, -1. > g) 32-bit codepoints This is in PEP 393. > h) UTF-32 What's that, as opposed to g)? I'm not open to revise PEP 393 in the direction of adding more representations. Regards, Martin From turnbull at sk.tsukuba.ac.jp Wed Aug 24 10:22:37 2011 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 24 Aug 2011 17:22:37 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> Message-ID: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Terry Reedy writes: > The current UCS2 Unicode string implementation, by design, quickly gives > WRONG answers for len(), iteration, indexing, and slicing if a string > contains any non-BMP (surrogate pair) Unicode characters. That may have > been excusable when there essentially were no such extended chars, and > the few there were were almost never used. Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. By design, they contain code point strings. Guido has made that absolutely clear on a number of occasions. And the reasons have very little to do with lack of non-BMP characters to trip up the implementation. Changing those semantics should have been done before the release of Python 3. It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today. There are a number of approaches that I can think of. 1. The "too bad if you can't take a joke" approach: do nothing and recommend UTF-32 to those who want len() to DTRT. 2. The "slope is slippery" approach: Implement UTF-16 objects as built-ins, and then try to fend off requests for correct treatment of unnormalized composed characters, normalization, compatibility substitutions, bidi, etc etc. 3. The "are we not hackers?" approach: Implement a transform that maps characters that are not represented by a single code point into Unicode private space, and then see if anybody really needs more than 6400 non-BMP characters. (Note that this would generalize to composed characters that don't have a one-code-point NFC form and similar non-standardized cases that nonstandard users might want handled.) 4. The "42" approach: sadly, I can't think deeply enough to explain it. There are probably others. It's true that Python is going to need good libraries to provide correct handling of Unicode strings (as opposed to unicode objects). But it's not clear to me given the wide variety of implementations I can imagine that there will be one best implementation, let alone which ones are good and Pythonic, and which not so. From victor.stinner at haypocalc.com Wed Aug 24 10:27:21 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 10:27:21 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <201108240046.16058.victor.stinner@haypocalc.com> <201108240056.48170.victor.stinner@haypocalc.com> Message-ID: <4E54B5E9.3070905@haypocalc.com> Le 24/08/2011 04:56, Torsten Becker a ?crit : > On Tue, Aug 23, 2011 at 18:56, Victor Stinner > wrote: >>> kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still >>> necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). >> >> If it can be removed, it would be nice to have kind in [0; 2] instead of kind >> in [1; 2], to be able to have a list (of 3 items) => callback or label. > > It is also used in PyUnicode_DecodeUTF8Stateful() and there might be > some cases which I missed converting checks for 0 when I introduced > the macro. The question was more if this should be written as 0 or as > a named constant. I preferred the named constant for readability. > > An alternative would be to have kind values be the same as the number > of bytes for the string representation so it would be 0 (wstr), 1 > (1-byte), 2 (2-byte), or 4 (4-byte). Please don't do that: it's more common to need contiguous arrays (for a jump table/callback list) than having to know the character size. You can use an array giving the character size: CHARACTER_SIZE[kind] which is the array {0, 1, 2, 4} (or maybe sizeof(wchar_t) instead of 0 ?). > I think the value for wstr/uninitialized/reserved should not be > removed. The wstr representation is still used in the error case in > the utf8 decoder because these strings can be resized. In Python, you can resize an object if it has only one reference. Why is it not possible in your branch? Oh, I missed the UTF-8 decoder because you wrote "kind = 0": please, use PyUnicode_WCHAR_KIND instead! I don't like "reserved" value, especially if its value is 0, the first value. See Microsoft file formats: they waste a lot of space because most fields are reserved, and 10 years later, these fields are still unused. Can't we add the value 4 when we will need a new kind? > Also having one > designated value for "uninitialized" limits comparisons in the > affected functions to the kind value, otherwise they would need to > check the str field for NULL to determine in which buffer to write a > character. I have to read the code more carefully, I don't know this "uninitialized" state. For kind=0: "wstr" means that str is NULL but wstr is set? I didn't understand that str can be NULL for an initialized string. I should read the PEP again :-) >> I suppose that compilers prefer a switch with all cases defined, 0 a first item >> and contiguous values. We may need an enum. > > During the Summer of Code, Martin and I did a experiment with GCC and > it did not seem to produce a jump table as an optimization for three > cases but generated comparison instructions anyway. You mean with a switch with a case for each possible value? I don't think that GCC knows that all cases are defined if you don't use an enum. > I am not sure how much we should optimize for potential compiler > optimizations here. Oh, it was just a suggestion. Sure, it's not the best moment to care of micro-optimizations. Victor From cs at zip.com.au Wed Aug 24 10:54:48 2011 From: cs at zip.com.au (Cameron Simpson) Date: Wed, 24 Aug 2011 18:54:48 +1000 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: References: Message-ID: <20110824085448.GA10991@cskk.homeip.net> On 24Aug2011 12:31, Nick Coghlan wrote: | On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano wrote: | > Antoine Pitrou wrote: | >> When reviewing the PEP 3151 implementation (*), Ezio commented that | >> "FileSystemError" looks a bit strange and that "FilesystemError" would | >> be a better spelling. What is your opinion? | > | > It's a file system (two words), not filesystem (not in any dictionary or | > spell checker I've ever used). | | I rarely find spell checkers to be useful sources of data on correct | spelling of technical jargon (and the computing usage of the term | 'filesystem' definitely qualifies as jargon). | | > (Nor do we write filingsystem, governmentsystem, politicalsystem or | > schoolsystem. This is English, not German.) | | Personally, I think 'filesystem' is a portmanteau in the process of | coming into existence (as evidenced by usage like 'FHS' standing for | 'Filesystem Hierarchy Standard'). However, the two word form is still | useful at times, particularly for disambiguation of acronyms (as | evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and | 'Google File System'). Funny, I thought NFS stood for Not a File System :-) | Since I tend to use the one word 'filesystem' form myself (ditto for | 'filename'), I'm +1 for FilesystemError, but I'm only -0 for | FileSystemError (so I expect that will be the option chosen, given | other responses). I also use "filesystem" as a one word piece of jargon, but I am persuaded by the language arguments. So I'm +1 for FileSystemError. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Bolts get me through times of no courage better than courage gets me through times of no bolts! - Eric Hirst From v+python at g.nevcal.com Wed Aug 24 11:22:58 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 24 Aug 2011 02:22:58 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54B3CC.9040900@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E54AEC8.7040702@g.nevcal.com> <4E54B3CC.9040900@v.loewis.de> Message-ID: <4E54C2F2.2090606@g.nevcal.com> On 8/24/2011 1:18 AM, "Martin v. L?wis" wrote: >> So am I correctly reading between the lines when, after reading this >> thread so far, and the complete issue discussion so far, that I see a >> PEP 393 revision or replacement that has the following characteristics: >> >> 1) Narrow builds are dropped. > PEP 393 already drops narrow builds. I'd forgotten that. > >> 2) There are more, or different, internal kinds of strings, which affect >> the processing patterns. > This is the basic idea of PEP 393. Agreed. > >> a) all ASCII >> b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This >> kind may not be able to support a "mostly" variation, and may be no more >> efficient than case b). But it might also be popular in parts of Europe > This two cases are already in PEP 393. Sure. Wanted to enumerate all, rather than just add-ons. >> c) mostly ASCII (utf8) with clever indexing/caching to be efficient >> d) UTF-8 with clever indexing/caching to be efficient > I see neither a need nor a means to consider these. The discussion about "mostly ASCII" strings seems convincing that there could be a significant space savings if such were implemented. >> e) 16-bit codepoints > These are in PEP 393. > >> f) UTF-16 with clever indexing/caching to be efficient > Again, -1. This is probably the one I would pick as least likely to be useful if the rest were implemented. >> g) 32-bit codepoints > This is in PEP 393. > >> h) UTF-32 > What's that, as opposed to g)? g) would permit codes greater than u+10ffff and would permit the illegal codepoints and lone surrogates. h) would be strict Unicode conformance. Sorry that the 4 paragraphs of explanation that you didn't quote didn't make that clear. > I'm not open to revise PEP 393 in the direction of adding more > representations. > It's your PEP. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scott+python-dev at scottdial.com Wed Aug 24 11:25:18 2011 From: scott+python-dev at scottdial.com (Scott Dial) Date: Wed, 24 Aug 2011 05:25:18 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54B246.5020008@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <201108240038.00801.victor.stinner@haypocalc.com> <4E54852E.9000601@scottdial.com> <4E54B246.5020008@haypocalc.com> Message-ID: <4E54C37E.6040305@scottdial.com> On 8/24/2011 4:11 AM, Victor Stinner wrote: > Le 24/08/2011 06:59, Scott Dial a ?crit : >> On 8/23/2011 6:38 PM, Victor Stinner wrote: >>> Le mardi 23 ao?t 2011 00:14:40, Antoine Pitrou a ?crit : >>>> - You could try to run stringbench, which can be found at >>>> http://svn.python.org/projects/sandbox/trunk/stringbench (*) >>>> and there's iobench (the text mode benchmarks) in the Tools/iobench >>>> directory. >>> >>> Some raw numbers. >>> >>> stringbench: >>> "147.07 203.07 72.4 TOTAL" for the PEP 393 >>> "146.81 140.39 104.6 TOTAL" for default >>> => PEP is 45% slower >> >> I ran the same benchmark and couldn't make a distinction in performance >> between them: > > Hum, are you sure that you used the PEP 383? Make sure that you are > using the pep-383 branch! I also started my benchmark on the wrong > branch :-) You are right. I used the "Get Source" link on bitbucket to save pulling the whole clone, but the "Get Source" link seems to be whatever branch has the lastest revision (maybe?) even if you switch branches on the webpage. To correct my previous post: cpython.txt 183.26 177.97 103.0 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7 TOTAL pep-393.txt 181.40 270.34 67.1 TOTAL And, cpython.txt real 0m32.493s cpython-wide-unicode.txt real 0m33.489s pep-393.txt real 0m36.206s -- Scott Dial scott at scottdial.com From martin at v.loewis.de Wed Aug 24 12:04:28 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 24 Aug 2011 12:04:28 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54B3B6.1020205@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <201108240027.37788.victor.stinner@haypocalc.com> <4E54B3B6.1020205@haypocalc.com> Message-ID: <4E54CCAC.5080408@v.loewis.de> Am 24.08.2011 10:17, schrieb Victor Stinner: > Le 24/08/2011 04:41, Torsten Becker a ?crit : >> On Tue, Aug 23, 2011 at 18:27, Victor Stinner >> wrote: >>> I posted a patch to re-add it: >>> http://bugs.python.org/issue12819#msg142867 >> >> Thank you for the patch! Note that this patch adds the fast path only >> to the helper function which determines the length of the string and >> the maximum character. The decoding part is still without a fast path >> for ASCII runs. > > Ah? If utf8_max_char_size_and_has_errors() returns no error hand > maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-) No: the pure-ASCII case is already optimized with memcpy. It's the mostly-ASCII case that is not optimized anymore in this PEP 393 implementation (the one with "ASCII runs" instead of "pure ASCII"). Regards, Martin From tjreedy at udel.edu Wed Aug 24 12:06:39 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 24 Aug 2011 06:06:39 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote: > Terry Reedy writes: > > > The current UCS2 Unicode string implementation, by design, quickly gives > > WRONG answers for len(), iteration, indexing, and slicing if a string > > contains any non-BMP (surrogate pair) Unicode characters. That may have > > been excusable when there essentially were no such extended chars, and > > the few there were were almost never used. > > Well, no, it gives the right answer according to the design. unicode > objects do not contain character strings. Excuse me for believing the fine 3.2 manual that says "Strings contain Unicode characters." (And to a naive reader, that implies that string iteration and indexing should produce Unicode characters.) > By design, they contain code point strings. For the purpose of my sentence, the same thing in that code points correspond to characters, where 'character' includes ascii control 'characters' and unicode analogs. The problem is that on narrow builds strings are NOT code point sequences. They are 2-byte code *unit* sequences. Single non-BMP code points are seen as 2 code units and hence given a length of 2, not 1. Strings iterate, index, and slice by 2-byte code units, not by code points. Python floats try to follow the IEEE standard as interpreted for Python (Python has its software exceptions rather than signalling versus non-signalling hardware signals). Python decimals slavishly follow the IEEE decimal standard. Python narrow build unicode breaks the standard for non-BMP code points and cosequently, breaks the re module even when it works for wide builds. As sys.maxunicode more or less says, only the BMP subset is fully supported. Any narrow build string with even 1 non-BMP char violates the standard. > Guido has made that absolutely clear on a number > of occasions. It is not clear what you mean, but recently on python-ideas he has reiterated that he intends bytes and strings to be conceptually different. Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays. Except they are not for non-BMP characters/codepoints. Narrow build unicode is effectively an array of two-byte binary units. > And the reasons have very little to do with lack of > non-BMP characters to trip up the implementation. Changing those > semantics should have been done before the release of Python 3. The documentation was changed at least a bit for 3.0, and anyway, as indicated above, it is easy (especially for new users) to read the docs in a way that makes the current behavior buggy. I agree that the implementation should have been changed already. Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want. PEP 393 abolishes narrow builds as we now know them and changes semantics. I was answering a complaint about that change. If you do not like the PEP, fine. My separate proposal in my other post is for an alternative implementation but with, I presume, pretty the same visible changes. > It is not clear to me that it is a good idea to try to decide on "the" > correct implementation of Unicode strings in Python even today. If the implementation is invisible to the Python user, as I believe it should be without specially introspection, and mostly invisible in the C-API except for those who intentionally poke into the details, then the implementation can be changed as the consensus on best implementation changes. > There are a number of approaches that I can think of. > > 1. The "too bad if you can't take a joke" approach: do nothing and > recommend UTF-32 to those who want len() to DTRT. > 2. The "slope is slippery" approach: Implement UTF-16 objects as > built-ins, and then try to fend off requests for correct treatment > of unnormalized composed characters, normalization, compatibility > substitutions, bidi, etc etc. > 3. The "are we not hackers?" approach: Implement a transform that > maps characters that are not represented by a single code point > into Unicode private space, and then see if anybody really needs > more than 6400 non-BMP characters. (Note that this would > generalize to composed characters that don't have a one-code-point > NFC form and similar non-standardized cases that nonstandard users > might want handled.) > 4. The "42" approach: sadly, I can't think deeply enough to explain it. > > There are probably others. > > It's true that Python is going to need good libraries to provide > correct handling of Unicode strings (as opposed to unicode objects). Given that 3.0 unicode (string) objects are defined as Unicode character strings, I do not see the opposition. > But it's not clear to me given the wide variety of implementations I > can imagine that there will be one best implementation, let alone > which ones are good and Pythonic, and which not so. -- Terry Jan Reedy From martin at v.loewis.de Wed Aug 24 12:27:12 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 24 Aug 2011 12:27:12 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54B5E9.3070905@haypocalc.com> References: <201108240046.16058.victor.stinner@haypocalc.com> <201108240056.48170.victor.stinner@haypocalc.com> <4E54B5E9.3070905@haypocalc.com> Message-ID: <4E54D200.5060200@v.loewis.de> >> I think the value for wstr/uninitialized/reserved should not be >> removed. The wstr representation is still used in the error case in >> the utf8 decoder because these strings can be resized. > > In Python, you can resize an object if it has only one reference. Why is > it not possible in your branch? If you use the new API to create a string (knowing how many characters you have, and what the maximum character is), the Unicode object is allocated as a single memory block. It can then not be resized. If you allocate in the old style (i.e. giving NULL as the data pointer, and a length), it still creates a second memory blocks for the Py_UNICODE[], and allows resizing. When you then call PyUnicode_Ready, the object gets frozen. > I don't like "reserved" value, especially if its value is 0, the first > value. See Microsoft file formats: they waste a lot of space because > most fields are reserved, and 10 years later, these fields are still > unused. Can't we add the value 4 when we will need a new kind? I don't get the analogy, or the relationship with the value 0. "Reserving" the value 0 is entirely different from reserving a field. In a field, it wastes space; the value 0 however fills the same space as the values 1,2,3. It's just used to denote an object where the str pointer is not filled out yet, i.e. which can still be resized. >>> I suppose that compilers prefer a switch with all cases defined, 0 a >>> first item >>> and contiguous values. We may need an enum. >> >> During the Summer of Code, Martin and I did a experiment with GCC and >> it did not seem to produce a jump table as an optimization for three >> cases but generated comparison instructions anyway. > > You mean with a switch with a case for each possible value? No, a computed jump on the assembler level. Consider this code enum kind {null,ucs1,ucs2,ucs4}; void foo(void *d, enum kind k, int i, int v) { switch(k){ case ucs1:((unsigned char*)d)[i] = v;break; case ucs2:((unsigned short*)d)[i] = v;break; case ucs4:((unsigned int*)d)[i] = v;break; } } gcc 4.6.1 compiles this to foo: .LFB0: .cfi_startproc cmpl $2, %esi je .L4 cmpl $3, %esi je .L5 cmpl $1, %esi je .L7 .p2align 4,,5 rep ret .p2align 4,,10 .p2align 3 .L7: movslq %edx, %rdx movb %cl, (%rdi,%rdx) ret .p2align 4,,10 .p2align 3 .L5: movslq %edx, %rdx movl %ecx, (%rdi,%rdx,4) ret .p2align 4,,10 .p2align 3 .L4: movslq %edx, %rdx movw %cx, (%rdi,%rdx,2) ret .cfi_endproc As you can see, it generates a chain of compares, rather than an indirect jump through a jump table. Regards, Martin From eliben at gmail.com Wed Aug 24 13:09:46 2011 From: eliben at gmail.com (Eli Bendersky) Date: Wed, 24 Aug 2011 14:09:46 +0300 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: <20110823202004.0bb63490@pitrou.net> References: <20110823202004.0bb63490@pitrou.net> Message-ID: > When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? > > (*) http://bugs.python.org/issue12555 > +1 for FileSystemError Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Wed Aug 24 14:50:34 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 22:50:34 +1000 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X Message-ID: The buildbots are complaining about some of tests for the new socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that provide CMSG_LEN. http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/831/steps/test/logs/stdio Before I start trying to figure this out without a Mac to test on, are any of the devs that actually use Mac OS X seeing the failure in their local builds? Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Wed Aug 24 15:06:23 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 24 Aug 2011 23:06:23 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> Message-ID: On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: > In utf16.py, attached to http://bugs.python.org/issue12729 > I propose for consideration a prototype of different solution to the 'mostly > BMP chars, few non-BMP chars' case. Rather than expand every character from > 2 bytes to 4, attach an array cpdex of character (ie code point, not code > unit) indexes. Then for indexing and slicing, the correction is simple, > simpler than I first expected: > ?code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) > where code-unit-index is the adjusted index into the full underlying > double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids > most of the space penalty and the consequent time penalty of moving more > bytes around and increasing cache misses. Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From neologix at free.fr Wed Aug 24 15:31:50 2011 From: neologix at free.fr (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Wed, 24 Aug 2011 15:31:50 +0200 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X In-Reply-To: References: Message-ID: > The buildbots are complaining about some of tests for the new > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > provide CMSG_LEN. Looks like kernel bugs: http://developer.apple.com/library/mac/#qa/qa1541/_index.html """ Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor passing [...] Avoid passing two or more descriptors back-to-back. """ We should probably add @requires_mac_ver(10, 5) for testFDPassSeparate and testFDPassSeparateMinSpace. As for InterruptedSendTimeoutTest and testInterruptedSendmsgTimeout, it also looks like a kernel bug: the syscall should fail with EINTR once the socket buffer is full. I guess one should skip those on OS-X. From stefan_ml at behnel.de Wed Aug 24 18:00:42 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Aug 2011 18:00:42 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> Message-ID: Nick Coghlan, 24.08.2011 15:06: > On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: >> In utf16.py, attached to http://bugs.python.org/issue12729 >> I propose for consideration a prototype of different solution to the 'mostly >> BMP chars, few non-BMP chars' case. Rather than expand every character from >> 2 bytes to 4, attach an array cpdex of character (ie code point, not code >> unit) indexes. Then for indexing and slicing, the correction is simple, >> simpler than I first expected: >> code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) >> where code-unit-index is the adjusted index into the full underlying >> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids >> most of the space penalty and the consequent time penalty of moving more >> bytes around and increasing cache misses. > > Interesting idea, but putting on my C programmer hat, I say -1. > > Non-uniform cell size = not a C array = standard C array manipulation > idioms don't work = pain (no matter how simple the index correction > happens to be). > > The nice thing about PEP 383 is that it gives us the smallest storage > array that is both an ordinary C array and has sufficiently large > individual elements to handle every character in the string. +1 Stefan From stephen at xemacs.org Wed Aug 24 18:34:17 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 01:34:17 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> Terry Reedy writes: > Excuse me for believing the fine 3.2 manual that says > "Strings contain Unicode characters." The manual is wrong, then, subject to a pronouncement to the contrary, of course. I was on your side of the fence when this was discussed, pre-release. I was wrong then. My bet is that we are still wrong, now. > For the purpose of my sentence, the same thing in that code points > correspond to characters, Not in Unicode, they do not. By definition, a small number of code points (eg, U+FFFF) *never* did and *never* will correspond to characters. Since about Unicode 3.0, the same is true of surrogate code points. Some restrictions have been placed on what can be done with composed characters, so even with the PEP (which gives us code point arrays) we do not really get arrays of Unicode characters that fully conform to the model. > strings are NOT code point sequences. They are 2-byte code *unit* > sequences. I stand corrected on Unicode terminology. "Code unit" is what I meant, and what I understand Guido to have defined unicode objects as arrays of. > Any narrow build string with even 1 non-BMP char violates the > standard. Yup. That's by design. > > Guido has made that absolutely clear on a number > > of occasions. > > It is not clear what you mean, but recently on python-ideas he has > reiterated that he intends bytes and strings to be conceptually > different. Sure. Nevertheless, practicality beat purity long ago, and that decision has never been rescinded AFAIK. > Bytes are computer-oriented binary arrays; strings are > supposedly human-oriented character/codepoint arrays. And indeed they are, in UCS-4 builds. But they are *not* in Unicode! Unicode violates the array model. Specifically, in handling composing characters, and in bidi, where arbitrary slicing of direction control characters will result in garbled display. The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard. Of the remaining 10%, 90% are not going to need both huge strings *and* ABI interoperability with C modules compiled for UCS-2, so UCS-4 is satisfactory. Of the remaining 1% of all applications, those that deal with huge strings *and* need full Unicode conformance, well, they need efficiency too almost by definition. They probably are going to want something more efficient than either the UTF-16 or the UTF-32 representation can provide, and therefore will need trickier, possibly app-specific, algorithms that probably do not belong in an initial implementation. > > And the reasons have very little to do with lack of > > non-BMP characters to trip up the implementation. Changing those > > semantics should have been done before the release of Python 3. > > The documentation was changed at least a bit for 3.0, and anyway, as > indicated above, it is easy (especially for new users) to read the docs > in a way that makes the current behavior buggy. I agree that the > implementation should have been changed already. I don't. I suspect Guido does not, even today. > Currently, the meaning of Python code differs on narrow versus wide > build, and in a way that few users would expect or want. Let them become developers, then, and show us how to do it better. > PEP 393 abolishes narrow builds as we now know them and changes > semantics. I was answering a complaint about that change. If you do > not like the PEP, fine. No, I do like the PEP. However, it is only a step, a rather conservative one in some ways, toward conformance to the Unicode character model. In particular, it does nothing to resolve the fact that len() will give different answers for character count depending on normalization, and that slicing and indexing will allow you to cut characters in half (even in NFC, since not all composed characters have fully composed forms). > > It is not clear to me that it is a good idea to try to decide on "the" > > correct implementation of Unicode strings in Python even today. > > If the implementation is invisible to the Python user, as I believe it > should be without specially introspection, and mostly invisible in the > C-API except for those who intentionally poke into the details, then the > implementation can be changed as the consensus on best implementation > changes. A naive implementation of UTF-16 will be quite visible in terms of performance, I suspect, and performance-oriented applications will "go behind the API's back" to get it. We're already seeing that in the people who insist that bytes are characters too, and string APIs should work on them just as they do on (Unicode) strings. > > It's true that Python is going to need good libraries to provide > > correct handling of Unicode strings (as opposed to unicode objects). > > Given that 3.0 unicode (string) objects are defined as Unicode character > strings, I do not see the opposition. I think they're not, I think they're defined as Unicode code unit arrays, and that the documentation is in error. If the documentation is correct, then Python 3.0 was released about 5 years too early, because correct handling of those objects as arrays of Unicode characters has never been implemented or even discussed in terms of proposed code that I know of. Martin has long claimed that the fact that I/O is done in terms of UTF-16 means that the internal representation is UTF-16, so I could be wrong. But when issues of slicing, len() values and so on have come up in the past, Guido has always said "no, there will be no change in semantics of builtins here". From solipsis at pitrou.net Wed Aug 24 18:38:46 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 24 Aug 2011 18:38:46 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20110824183846.2b392f77@pitrou.net> On Thu, 25 Aug 2011 01:34:17 +0900 "Stephen J. Turnbull" wrote: > > Martin has long claimed that the fact that I/O is done in terms of > UTF-16 means that the internal representation is UTF-16 Which I/O? From solipsis at pitrou.net Wed Aug 24 18:49:27 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 24 Aug 2011 18:49:27 +0200 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X References: Message-ID: <20110824184927.2697b0af@pitrou.net> On Wed, 24 Aug 2011 15:31:50 +0200 Charles-Fran?ois Natali wrote: > > The buildbots are complaining about some of tests for the new > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > provide CMSG_LEN. > > Looks like kernel bugs: > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > """ > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor passing > [...] > Avoid passing two or more descriptors back-to-back. > """ But Snow Leopard, where these failures occur, is OS X 10.6. Antoine. From riscutiavlad at gmail.com Wed Aug 24 18:57:56 2011 From: riscutiavlad at gmail.com (Vlad Riscutia) Date: Wed, 24 Aug 2011 09:57:56 -0700 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: References: <20110823202004.0bb63490@pitrou.net> Message-ID: +1 for FileSystemError. I see myself misspelling it as FileSystemError if we go with alternate spelling. I'll probably won't be the only one. Thank you, Vlad On Wed, Aug 24, 2011 at 4:09 AM, Eli Bendersky wrote: > > When reviewing the PEP 3151 implementation (*), Ezio commented that >> "FileSystemError" looks a bit strange and that "FilesystemError" would >> be a better spelling. What is your opinion? >> >> (*) http://bugs.python.org/issue12555 >> > > +1 for FileSystemError > > Eli > > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/riscutiavlad%40gmail.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Wed Aug 24 19:15:48 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 02:15:48 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110824183846.2b392f77@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <20110824183846.2b392f77@pitrou.net> Message-ID: <87mxeyzq63.fsf@uwakimon.sk.tsukuba.ac.jp> Antoine Pitrou writes: > On Thu, 25 Aug 2011 01:34:17 +0900 > "Stephen J. Turnbull" wrote: > > > > Martin has long claimed that the fact that I/O is done in terms of > > UTF-16 means that the internal representation is UTF-16 > > Which I/O? Eg, display of characters in the interpreter. From solipsis at pitrou.net Wed Aug 24 19:16:29 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 24 Aug 2011 19:16:29 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87mxeyzq63.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <20110824183846.2b392f77@pitrou.net> <87mxeyzq63.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1314206189.3549.2.camel@localhost.localdomain> Le jeudi 25 ao?t 2011 ? 02:15 +0900, Stephen J. Turnbull a ?crit : > Antoine Pitrou writes: > > On Thu, 25 Aug 2011 01:34:17 +0900 > > "Stephen J. Turnbull" wrote: > > > > > > Martin has long claimed that the fact that I/O is done in terms of > > > UTF-16 means that the internal representation is UTF-16 > > > > Which I/O? > > Eg, display of characters in the interpreter. I don't know why you say it's "done in terms of UTF-16", then. Unicode strings are simply encoded to whatever character set is detected as the terminal's character set. Regards Antoine. From victor.stinner at haypocalc.com Wed Aug 24 19:45:27 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 19:45:27 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> Message-ID: <4E5538B7.8010709@haypocalc.com> Le 24/08/2011 02:46, Terry Reedy a ?crit : > On 8/23/2011 9:21 AM, Victor Stinner wrote: >> Le 23/08/2011 15:06, "Martin v. L?wis" a ?crit : >>> Well, things have to be done in order: >>> 1. the PEP needs to be approved >>> 2. the performance bottlenecks need to be identified >>> 3. optimizations should be applied. >> >> I would not vote for the PEP if it slows down Python, especially if it's >> much slower. But Torsten says that it speeds up Python, which is >> surprising. I have to do my own benchmarks :-) > > The current UCS2 Unicode string implementation, by design, quickly gives > WRONG answers for len(), iteration, indexing, and slicing if a string > contains any non-BMP (surrogate pair) Unicode characters. That may have > been excusable when there essentially were no such extended chars, and > the few there were were almost never used. But now there are many more, > with more being added to each Unicode edition. They include cursive Math > letters that are used in English documents today. The problem will > slowly get worse and Python, at least on Windows, will become a language > to avoid for dependable Unicode document processing. 3.x needs a proper > Unicode implementation that works for all strings on all builds. I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+. For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n). > utf16.py, attached to http://bugs.python.org/issue12729 > prototypes a different solution than the PEP for the above problems for > the 'mostly BMP' case. I will discuss it in a different post. Yeah, you can workaround UTF-16 limits using O(n) algorithms. PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions. Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str). Victor From martin at v.loewis.de Wed Aug 24 19:50:13 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 24 Aug 2011 19:50:13 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E5539D5.60500@v.loewis.de> > > PEP 393 abolishes narrow builds as we now know them and changes > > semantics. I was answering a complaint about that change. If you do > > not like the PEP, fine. > > No, I do like the PEP. However, it is only a step, a rather > conservative one in some ways, toward conformance to the Unicode > character model. I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP. The primary objective is the reduction in memory usage. (any changes in runtime are also side effects, and it's not really clear yet whether you get speedups or slowdowns on average, or no effect). > > Given that 3.0 unicode (string) objects are defined as Unicode character > > strings, I do not see the opposition. > > I think they're not, I think they're defined as Unicode code unit > arrays, and that the documentation is in error. That's just a description of the implementation, and not part of the language, though. My understanding is that the "abstract Python language definition" considers this aspect implementation-defined: PyPy, Jython, IronPython etc. would be free to do things differently (and I understand that there are plans to do PEP-393 style Unicode objects in PyPy). > Martin has long claimed that the fact that I/O is done in terms of > UTF-16 means that the internal representation is UTF-16, so I could be > wrong. But when issues of slicing, len() values and so on have come > up in the past, Guido has always said "no, there will be no change in > semantics of builtins here". Not with these words, though. As I recall, it's rather like (still with different words) "len() will stay O(1) forever, regardless of any perceived incorrectness of this choice". An attempt to change the builtins to introduce higher complexity for the sake of correctness is what he rejects. I think PEP 393 balances this well, keeping the O(1) operations in that complexity, while improving the cross- platform "correctness" of these functions. Regards, Martin From martin at v.loewis.de Wed Aug 24 19:54:06 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 24 Aug 2011 19:54:06 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <1314206189.3549.2.camel@localhost.localdomain> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <20110824183846.2b392f77@pitrou.net> <87mxeyzq63.fsf@uwakimon.sk.tsukuba.ac.jp> <1314206189.3549.2.camel@localhost.localdomain> Message-ID: <4E553ABE.7@v.loewis.de> >> Eg, display of characters in the interpreter. > > I don't know why you say it's "done in terms of UTF-16", then. Unicode > strings are simply encoded to whatever character set is detected as the > terminal's character set. I think what he means (and what I meant when I said something similar): I/O will consider surrogate pairs in the representation when converting to the output encoding. This is actually relevant only for UTF-8 (I think), which converts surrogate pairs "correctly". This can be taken as a proof that Python 3.2 is "UTF-16 aware" (in some places, but not in others). With Python's I/O architecture, it is of course not *actually* the I/O which considers UTF-16, but the codec. Regards, Martin From victor.stinner at haypocalc.com Wed Aug 24 20:00:45 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 20:00:45 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E54C2F2.2090606@g.nevcal.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E54AEC8.7040702@g.nevcal.com> <4E54B3CC.9040900@v.loewis.de> <4E54C2F2.2090606@g.nevcal.com> Message-ID: <4E553C4D.4060104@haypocalc.com> Le 24/08/2011 11:22, Glenn Linderman a ?crit : >>> c) mostly ASCII (utf8) with clever indexing/caching to be efficient >>> d) UTF-8 with clever indexing/caching to be efficient >> I see neither a need nor a means to consider these. > > The discussion about "mostly ASCII" strings seems convincing that there > could be a significant space savings if such were implemented. Antoine's optimization in the UTF-8 decoder has been removed. It doesn't change the memory footprint, it is just slower to create the Unicode object. When you decode an UTF-8 string: - "abc" string uses "latin1" (8 bits) units - "a?" string uses "latin1" (8 bits) units <= cool! - "a?" string uses UCS2 (16 bits) units - "a\U0010FFFF" string uses UCS4 (32 bits) units Victor From martin at v.loewis.de Wed Aug 24 20:15:24 2011 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 24 Aug 2011 20:15:24 +0200 Subject: [Python-Dev] PEP 393 review Message-ID: <4E553FBC.7080501@v.loewis.de> Guido has agreed to eventually pronounce on PEP 393. Before that can happen, I'd like to collect feedback on it. There have been a number of voice supporting the PEP in principle, so I'm now interested in comments in the following areas: - principle objection. I'll list them in the PEP. - issues to be considered (unclarities, bugs, limitations, ...) - conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open. Regards, Martin From solipsis at pitrou.net Wed Aug 24 20:32:28 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 24 Aug 2011 20:32:28 +0200 Subject: [Python-Dev] PEP 393 review References: <4E553FBC.7080501@v.loewis.de> Message-ID: <20110824203228.3e00874d@pitrou.net> On Wed, 24 Aug 2011 20:15:24 +0200 "Martin v. L?wis" wrote: > - issues to be considered (unclarities, bugs, limitations, ...) With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects? Are there any plans to make instantiation of small strings fast enough? Or is it already as fast as it should be? When interfacing with the Win32 "wide" APIs, what is the recommended way to get the required LPCWSTR? Will the format codes returning a Py_UNICODE pointer with PyArg_ParseTuple be deprecated? Do you think the wstr representation could be removed in some future version of Python? Is PyUnicode_Ready() necessary for all unicode objects, or only those allocated through the legacy API? ?The Py_Unicode representation is not instantaneously available?: you mean the Py_UNICODE representation? > - conditions you would like to pose on the implementation before > acceptance. I'll see which of these can be resolved, and list > the ones that remain open. That it doesn't significantly slow down benchmarks such as stringbench and iobench. Regards Antoine. From nad at acm.org Wed Aug 24 20:37:20 2011 From: nad at acm.org (Ned Deily) Date: Wed, 24 Aug 2011 11:37:20 -0700 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X References: <20110824184927.2697b0af@pitrou.net> Message-ID: In article <20110824184927.2697b0af at pitrou.net>, Antoine Pitrou wrote: > On Wed, 24 Aug 2011 15:31:50 +0200 > Charles-Fran?ois Natali wrote: > > > The buildbots are complaining about some of tests for the new > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > > provide CMSG_LEN. > > > > Looks like kernel bugs: > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > > > """ > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor > > passing > > [...] > > Avoid passing two or more descriptors back-to-back. > > """ > > But Snow Leopard, where these failures occur, is OS X 10.6. But chances are the build is using the default 10.4 ABI. Adding MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix it. There is an open issue to change configure to use better defaults for this. (I'm right in the middle of reconfiguring my development systems so I can't test it myself immediately but I'll report back shortly.) -- Ned Deily, nad at acm.org From tjreedy at udel.edu Wed Aug 24 20:46:21 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 24 Aug 2011 14:46:21 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5539D5.60500@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5539D5.60500@v.loewis.de> Message-ID: On 8/24/2011 1:50 PM, "Martin v. L?wis" wrote: > I'd like to point out that the improved compatibility is only a side > effect, not the primary objective of the PEP. Then why does the Rationale start with "on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported."? A Windows user can only solve this problem by switching to *nix. > The primary objective is the reduction in memory usage. On average (perhaps). As I understand the PEP, for some strings, Windows users will see a doubling of memory usage. Statistically, that doubling is probably more likely in longer texts. Ascii-only Python code and other limited-to-ascii text will benefit. Typical English business documents will see no change as they often have proper non-ascii quotes and occasional accented characters, trademark symbols, and other things. I think you have the objectives backwards. Adding memory is a lot easier than switching OSes. -- Terry Jan Reedy From solipsis at pitrou.net Wed Aug 24 20:50:47 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 24 Aug 2011 20:50:47 +0200 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X References: <20110824184927.2697b0af@pitrou.net> Message-ID: <20110824205047.6be49525@pitrou.net> On Wed, 24 Aug 2011 11:37:20 -0700 Ned Deily wrote: > In article <20110824184927.2697b0af at pitrou.net>, > Antoine Pitrou wrote: > > On Wed, 24 Aug 2011 15:31:50 +0200 > > Charles-Fran?ois Natali wrote: > > > > The buildbots are complaining about some of tests for the new > > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > > > provide CMSG_LEN. > > > > > > Looks like kernel bugs: > > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > > > > > """ > > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor > > > passing > > > [...] > > > Avoid passing two or more descriptors back-to-back. > > > """ > > > > But Snow Leopard, where these failures occur, is OS X 10.6. > > But chances are the build is using the default 10.4 ABI. Adding > MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix > it. Does the ABI affect kernel bugs? Regards Antoine. From v+python at g.nevcal.com Wed Aug 24 20:52:51 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 24 Aug 2011 11:52:51 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> Message-ID: <4E554883.5020908@g.nevcal.com> On 8/24/2011 9:00 AM, Stefan Behnel wrote: > Nick Coghlan, 24.08.2011 15:06: >> On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: >>> In utf16.py, attached to http://bugs.python.org/issue12729 >>> I propose for consideration a prototype of different solution to the >>> 'mostly >>> BMP chars, few non-BMP chars' case. Rather than expand every >>> character from >>> 2 bytes to 4, attach an array cpdex of character (ie code point, not >>> code >>> unit) indexes. Then for indexing and slicing, the correction is simple, >>> simpler than I first expected: >>> code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) >>> where code-unit-index is the adjusted index into the full underlying >>> double-byte array. This adds a time penalty of log2(len(cpdex)), but >>> avoids >>> most of the space penalty and the consequent time penalty of moving >>> more >>> bytes around and increasing cache misses. >> >> Interesting idea, but putting on my C programmer hat, I say -1. >> >> Non-uniform cell size = not a C array = standard C array manipulation >> idioms don't work = pain (no matter how simple the index correction >> happens to be). >> >> The nice thing about PEP 383 is that it gives us the smallest storage >> array that is both an ordinary C array and has sufficiently large >> individual elements to handle every character in the string. > > +1 Yes, this sounds like a nice benefit, but the problem is it is false. The correct statement would be: The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every Unicode codepoint in the string. As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint. It seems there are three concepts in Unicode, code units, codepoints, and characters, none of which are equivalent (and the first of which varies according to the encoding). It also seems (to me) that Unicode has failed in its original premise, of being an easy way to handle "big char" for "all languages" with fixed size elements, but it is not clear that its original premise is achievable regardless of the size of "big char", when mixed directionality is desired, and it seems that support of some single languages require mixed directionality, not to mention mixed language support. Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency. On the other hand, there are large subsets of applications that simply do not require support for bidirectional text or composed characters, and for those that do not, it remains to be seen if the price to be paid for supporting those features is too high a price for such applications. So far, we don't have implementations to benchmark to figure that out! What does this mean for Python? Well, if Python is willing to limit its support for applications to the subset for which the "big char" solution sufficient, then PEP 393 provides a way to do that, that looks to be pretty effective for reducing memory consumption for those applications that use short strings most of which can be classified by content into the 1 byte or 2 byte representations. Applications that support long strings are more likely to bitten by the occasional "outlier" character that is longer than the average character, doubling or quadrupling the space needed to represent such strings, and eliminating a significant portion of the space savings the PEP is providing for other applications. Benchmarks may or may not fully reflect the actual requirements of all applications, so conclusions based on benchmarking can easily be blind-sided the realities of other applications, unless the benchmarks are carefully constructed. It is possible that the ideas in PEP 393, with its support for multiple underlying representations, could be the basis for some more complex representations that would better support characters rather than only supporting code points, but Martin has stated he is not open to additional representations, so the PEP itself cannot be that basis (although with care which may or may not be taken in the implementation of the PEP, the implementation may still provide that basis). -------------- next part -------------- An HTML attachment was scrubbed... URL: From cf.natali at gmail.com Wed Aug 24 21:02:59 2011 From: cf.natali at gmail.com (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Wed, 24 Aug 2011 21:02:59 +0200 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X In-Reply-To: <20110824184927.2697b0af@pitrou.net> References: <20110824184927.2697b0af@pitrou.net> Message-ID: > But Snow Leopard, where these failures occur, is OS X 10.6. *sighs* It still looks like a kernel/libc bug to me: AFAICT, both the code and the tests are correct. And apparently, there are still issues pertaining to FD passing on 10.5 (and maybe later, I couldn't find a public access to their bug tracker): http://lists.apple.com/archives/Darwin-dev/2008/Feb/msg00033.html Anyway, if someone with a recent OS X release could run test_socket, it would probably help. Follow ups to http://bugs.python.org/issue6560 From guido at python.org Wed Aug 24 21:34:34 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 24 Aug 2011 12:34:34 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E554883.5020908@g.nevcal.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E554883.5020908@g.nevcal.com> Message-ID: On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman wrote: > On 8/24/2011 9:00 AM, Stefan Behnel wrote: > > Nick Coghlan, 24.08.2011 15:06: > > On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: > > In utf16.py, attached to http://bugs.python.org/issue12729 > I propose for consideration a prototype of different solution to the 'mostly > BMP chars, few non-BMP chars' case. Rather than expand every character from > 2 bytes to 4, attach an array cpdex of character (ie code point, not code > unit) indexes. Then for indexing and slicing, the correction is simple, > simpler than I first expected: > ? code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) > where code-unit-index is the adjusted index into the full underlying > double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids > most of the space penalty and the consequent time penalty of moving more > bytes around and increasing cache misses. > > Interesting idea, but putting on my C programmer hat, I say -1. > > Non-uniform cell size = not a C array = standard C array manipulation > idioms don't work = pain (no matter how simple the index correction > happens to be). > > The nice thing about PEP 383 is that it gives us the smallest storage > array that is both an ordinary C array and has sufficiently large > individual elements to handle every character in the string. > > +1 > > Yes, this sounds like a nice benefit, but the problem is it is false.? The > correct statement would be: > > The nice thing about PEP 383 is that it gives us the smallest storage > array that is both an ordinary C array and has sufficiently large > individual elements to handle every Unicode codepoint in the string. (PEP 393, I presume. :-) > As Tom eloquently describes in the referenced issue (is Tom ever > non-eloquent?), not all characters can be represented in a single codepoint. But this is also besides the point (except insofar where we have to remind ourselves not to confuse the two in docs). > It seems there are three concepts in Unicode, code units, codepoints, and > characters, none of which are equivalent (and the first of which varies > according to the encoding). It also seems (to me) that Unicode has failed > in its original premise, of being an easy way to handle "big char" for "all > languages" with fixed size elements, but it is not clear that its original > premise is achievable regardless of the size of "big char", when mixed > directionality is desired, and it seems that support of some single > languages require mixed directionality, not to mention mixed language > support. I see nothing wrong with having the language's fundamental data types (i.e., the unicode object, and even the re module) to be defined in terms of codepoints, not characters, and I see nothing wrong with len() returning the number of codepoints (as long as it is advertised as such). After all UTF-8 also defines an encoding for a sequence of code points. Characters that require two or more codepoints are not represented special in UTF-8 -- they are represented as two or more encoded codepoints. The added requirement that UTF-8 must only be used to represent valid characters is just that -- it doesn't affect how strings are encoded, just what is considered valid at a higher level. > Given the required variability of character size in all presently Unicode > defined encodings, I tend to agree with Tom that UTF-8, together with some > technique of translating character index to code unit offset, may provide > the best overall space utilization, and adequate CPU efficiency. There is no doubt that UTF-8 is the most space efficient. I just don't think it is worth giving up O(1) indexing of codepoints -- it would change programmers' expectations too much. OTOH I am sold on getting rid of the added complexities of "narrow builds" where not even all codepoints can be represented without using surrogate pairs (i.e. two code units per codepoint) and indexing uses code units instead of codepoints. I think this is an area where PEP 393 has a huge advantage: users can get rid of their exceptions for narrow builds. > On the > other hand, there are large subsets of applications that simply do not > require support for bidirectional text or composed characters, and for those > that do not, it remains to be seen if the price to be paid for supporting > those features is too high a price for such applications. So far, we don't > have implementations to benchmark to figure that out! I think you are saying that many apps can ignore the distinction between codepoints and characters. Given the complexity of bidi rendering and normalization (which will always remain an issue) I agree; this is much less likely to be a burden than the narrow-build issues with code units vs. codepoints. What should the stdlib do? It should try to skirt the issue where it can (using the garbage-in-garbage-out principle) and advertise what it supports where there is a difference. I don't see why all the stdlib should be made aware of multi-codepoint-characters and other bidi requirements, but it should be clear to the user who has such requirements which stdlib operations they can safely use. > What does this mean for Python?? Well, if Python is willing to limit its > support for applications to the subset for which the "big char" solution > sufficient, then PEP 393 provides a way to do that, that looks to be pretty > effective for reducing memory consumption for those applications that use > short strings most of which can be classified by content into the 1 byte or > 2 byte representations.? Applications that support long strings are more > likely to bitten by the occasional "outlier" character that is longer than > the average character, doubling or quadrupling the space needed to represent > such strings, and eliminating a significant portion of the space savings the > PEP is providing for other applications. This seems more of an intuition than a fact. I could easily imagine the facts being that even for large strings, usually either there are no outliers, or there is a significant number of outliers. (E.g. Tom Christiansen's OSCON preso falls in the latter category :-). As long as it *works* I don't really mind that there are some extreme cases that are slow. You'll always have that. > Benchmarks may or may not fully > reflect the actual requirements of all applications, so conclusions based on > benchmarking can easily be blind-sided the realities of other applications, > unless the benchmarks are carefully constructed. Yeah, it's a learning process. > It is possible that the ideas in PEP 393, with its support for multiple > underlying representations, could be the basis for some more complex > representations that would better support characters rather than only > supporting code points, but Martin has stated he is not open to additional > representations, so the PEP itself cannot be that basis (although with care > which may or may not be taken in the implementation of the PEP, the > implementation may still provide that basis). There is always the possibility of representations that are defined purely by userland code and can only be manipulated by that specific code. But expecting C extensions to support new representations that haven't been defined yet sounds like a bad idea. -- --Guido van Rossum (python.org/~guido) From tjreedy at udel.edu Wed Aug 24 21:55:09 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 24 Aug 2011 15:55:09 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote: > Terry Reedy writes: > > > Excuse me for believing the fine 3.2 manual that says > > "Strings contain Unicode characters." > > The manual is wrong, then, subject to a pronouncement to the contrary, Please suggest a re-wording then, as it is a bug for doc and behavior to disagree. > > For the purpose of my sentence, the same thing in that code points > > correspond to characters, > > Not in Unicode, they do not. By definition, a small number of code > points (eg, U+FFFF) *never* did and *never* will correspond to > characters. On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says code point: 1) i in range(0x11000) 2) "A value, or position, for a character" (To muddy the waters more, 'character' has multiple definitions also.) You are using 1), I am using 2) ;-(. > > Any narrow build string with even 1 non-BMP char violates the > > standard. > > Yup. That's by design. [...] > Sure. Nevertheless, practicality beat purity long ago, and that > decision has never been rescinded AFAIK. I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers. > The thing is, that 90% of applications are not really going to care > about full conformance to the Unicode standard. I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors. > > Currently, the meaning of Python code differs on narrow versus wide > > build, and in a way that few users would expect or want. > > Let them become developers, then, and show us how to do it better. I posted a proposal with a link to a prototype implementation in Python. It pretty well solves the problem of narrow builds acting different from wide builds with respect to the basic operations of len(), iterations, indexing, and slicing. > No, I do like the PEP. However, it is only a step, a rather > conservative one in some ways, toward conformance to the Unicode > character model. In particular, it does nothing to resolve the fact > that len() will give different answers for character count depending > on normalization, and that slicing and indexing will allow you to cut > characters in half (even in NFC, since not all composed characters > have fully composed forms). I believe my scheme could be extended to solve that also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization. -- Terry Jan Reedy From nad at acm.org Wed Aug 24 22:15:27 2011 From: nad at acm.org (Ned Deily) Date: Wed, 24 Aug 2011 13:15:27 -0700 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X References: <20110824184927.2697b0af@pitrou.net> Message-ID: In article , Charles-Francois Natali wrote: > > But Snow Leopard, where these failures occur, is OS X 10.6. > > *sighs* > It still looks like a kernel/libc bug to me: AFAICT, both the code and > the tests are correct. > And apparently, there are still issues pertaining to FD passing on > 10.5 (and maybe later, I couldn't find a public access to their bug > tracker): > http://lists.apple.com/archives/Darwin-dev/2008/Feb/msg00033.html > > Anyway, if someone with a recent OS X release could run test_socket, > it would probably help. Follow ups to http://bugs.python.org/issue6560 I was able to do a quick test on 10.7 Lion and the 8 test failures still occur regardless of deployment target. Sorry, I don't have time to further investigate. -- Ned Deily, nad at acm.org From nad at acm.org Wed Aug 24 22:18:20 2011 From: nad at acm.org (Ned Deily) Date: Wed, 24 Aug 2011 13:18:20 -0700 Subject: [Python-Dev] sendmsg/recvmsg on Mac OS X References: <20110824184927.2697b0af@pitrou.net> <20110824205047.6be49525@pitrou.net> Message-ID: In article <20110824205047.6be49525 at pitrou.net>, Antoine Pitrou wrote: > On Wed, 24 Aug 2011 11:37:20 -0700 > Ned Deily wrote: > > In article <20110824184927.2697b0af at pitrou.net>, > > Antoine Pitrou wrote: > > > On Wed, 24 Aug 2011 15:31:50 +0200 > > > Charles-Fran?ois Natali wrote: > > > > > The buildbots are complaining about some of tests for the new > > > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > > > > provide CMSG_LEN. > > > > > > > > Looks like kernel bugs: > > > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > > > > > > > """ > > > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor > > > > passing > > > > [...] > > > > Avoid passing two or more descriptors back-to-back. > > > > """ > > > > > > But Snow Leopard, where these failures occur, is OS X 10.6. > > > > But chances are the build is using the default 10.4 ABI. Adding > > MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix > > it. > > Does the ABI affect kernel bugs? If it's more of a "libc" sort of bug (i.e. somewhere below the app layer), it could. But, unfortunately, that doesn't seem to be the case here. -- Ned Deily, nad at acm.org From tjreedy at udel.edu Wed Aug 24 22:37:21 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 24 Aug 2011 16:37:21 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5538B7.8010709@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> Message-ID: On 8/24/2011 1:45 PM, Victor Stinner wrote: > Le 24/08/2011 02:46, Terry Reedy a ?crit : > I don't think that using UTF-16 with surrogate pairs is really a big > problem. A lot of work has been done to hide this. For example, > repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. > Ezio fixed recently str.is*() methods in Python 3.2+. I greatly appreciate that he did. The * (lower,upper,title) methods apparently are not fixed yet as the corresponding new tests are currently skipped for narrow builds. > For len(str): its a known problem, but if you really care of the number > of *character* and not the number of UTF-16 units, it's easy to > implement your own character_length() function. len(str) gives the > UTF-16 units instead of the number of character for a simple reason: > it's faster: O(1), whereas character_length() is O(n). It is O(1) after a one-time O(n) preproccessing, which is the same time order for creating the string in the first place. Anyway, I think the most important deficiency is with iteration: >>> from unicodedata import name >>> name('\U0001043c') 'DESERET SMALL LETTER DEE' >>> for c in 'abc\U0001043c': print(name(c)) LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C Traceback (most recent call last): File "", line 2, in print(name(c)) ValueError: no such name This would work on wide builds but does not here (win7) because narrow build iteration produces a naked non-character surrogate code unit that has no specific entry in the Unicode Character Database. I believe that most new people who read "Strings contain Unicode characters." would expect string iteration to always produce the Unicode characters that they put in the string. The extra time per char needed to produce the surrogate pair that represents the character entered is O(1). >> utf16.py, attached to http://bugs.python.org/issue12729 >> prototypes a different solution than the PEP for the above problems for >> the 'mostly BMP' case. I will discuss it in a different post. > > Yeah, you can workaround UTF-16 limits using O(n) algorithms. I presented O(log(number of non-BMP chars)) algorithms for indexing and slicing. For the mostly BMP case, that is hugely better than O(n). > PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) > an all platforms with a small memory footprint and only O(1) functions. For Windows users, I believe it will nearly double the memory footprint if there are any non-BMP chars. On my new machine, I should not mind that in exchange for correct behavior. -- Terry Jan Reedy From ethan at stoneleaf.us Wed Aug 24 23:26:54 2011 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 24 Aug 2011 14:26:54 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> Message-ID: <4E556C9E.7070605@stoneleaf.us> Terry Reedy wrote: >> PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) >> an all platforms with a small memory footprint and only O(1) functions. > > For Windows users, I believe it will nearly double the memory footprint > if there are any non-BMP chars. On my new machine, I should not mind > that in exchange for correct behavior. > +1 Heck, I wouldn't mind it on my /old/ machine in exchange for correct behavior! ~Ethan~ From victor.stinner at haypocalc.com Wed Aug 24 23:10:32 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Wed, 24 Aug 2011 23:10:32 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E554883.5020908@g.nevcal.com> References: <4E554883.5020908@g.nevcal.com> Message-ID: <201108242310.32446.victor.stinner@haypocalc.com> Le mercredi 24 ao?t 2011 20:52:51, Glenn Linderman a ?crit : > Given the required variability of character size in all presently > Unicode defined encodings, I tend to agree with Tom that UTF-8, together > with some technique of translating character index to code unit offset, > may provide the best overall space utilization, and adequate CPU > efficiency. UTF-8 can use more space than latin1 or UCS2: >>> text="abc"; len(text.encode("latin1")), len(text.encode("utf8")) (3, 3) >>> text="???"; len(text.encode("latin1")), len(text.encode("utf8")) (3, 6) >>> text="???"; len(text.encode("utf-16-le")), len(text.encode("utf8")) (6, 9) >>> text="??"; len(text.encode("utf-16-le")), len(text.encode("utf8")) (4, 6) UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters (or few non-BMP characters). About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) (PEP 393 indexing). > ... Applications that support long > strings are more likely to bitten by the occasional "outlier" character > that is longer than the average character, doubling or quadrupling the > space needed to represent such strings, and eliminating a significant > portion of the space savings the PEP is providing for other > applications. In these worst cases, the PEP 393 is not worse than the current implementation: it just as much memory than Python in wide mode (mode used on Linux and Mac OS X because wchar_t is 32 bits). But it uses the double of Python in narrow mode (Windows). I agree than UTF-8 is better in these corner cases, but I also bet than most Python programs will use less memory and will be faster with the PEP 393. You can already try the pep-393 branch on your own programs. > Benchmarks may or may not fully reflect the actual > requirements of all applications, so conclusions based on benchmarking > can easily be blind-sided the realities of other applications, unless > the benchmarks are carefully constructed. I used stringbench and "./python -m test test_unicode". I plan to try iobench. Which other benchmark tool should be used? Should we write a new one? > It is possible that the ideas in PEP 393, with its support for multiple > underlying representations, could be the basis for some more complex > representations that would better support characters rather than only > supporting code points, ... I don't think that the *default* Unicode type is the best place for this. The base Unicode type has to be *very* efficient. If you have unusual needs, write your own type. Maybe based on the base type? Victor From martin at v.loewis.de Thu Aug 25 00:02:31 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 25 Aug 2011 00:02:31 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> Message-ID: <4E5574F7.1060605@v.loewis.de> > For Windows users, I believe it will nearly double the memory footprint > if there are any non-BMP chars. On my new machine, I should not mind > that in exchange for correct behavior. In addition, strings with non-BMP chars are much more rare than strings with all Latin-1, for which memory usage halves on Windows. Regards, Martin From victor.stinner at haypocalc.com Thu Aug 25 00:29:19 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 25 Aug 2011 00:29:19 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <20110824203228.3e00874d@pitrou.net> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> Message-ID: <201108250029.19506.victor.stinner@haypocalc.com> > With this PEP, the unicode object overhead grows to 10 pointer-sized > words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. > Does it have any adverse effects? For pure ASCII, it might be possible to use a shorter struct: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; /* no more utf8_length, utf8, str */ /* followed by ascii data */ } _PyASCIIObject; (-2 pointer -1 ssize_t: 56 bytes) => "a" is 58 bytes (with utf8 for free, without wchar_t) For object allocated with the new API, we can use a shorter struct: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; /* no more str pointer */ /* followed by latin1/ucs2/ucs4 data */ } _PyNewUnicodeObject; (-1 pointer: 72 bytes) => "?" is 74 bytes (without utf8 / wchar_t) For the legacy API: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; void *str; } _PyLegacyUnicodeObject; (same size: 80 bytes) => "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t) The current struct: typedef struct { PyObject_HEAD Py_ssize_t length; Py_UNICODE *str; Py_hash_t hash; int state; PyObject *defenc; } PyUnicodeObject; => "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is wchar_t) ... but the code (maybe only the macros?) and debuging will be more complex. > Will the format codes returning a Py_UNICODE pointer with > PyArg_ParseTuple be deprecated? Because Python 2.x is still dominant and it's already hard enough to port C modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*). > Do you think the wstr representation could be removed in some future > version of Python? Conversion to wchar_t* is common, especially on Windows. But I don't know if we *have to* cache the result. Is it cached by the way? Or is wstr only used when a string is created from Py_UNICODE? Victor From v+python at g.nevcal.com Thu Aug 25 00:29:54 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 24 Aug 2011 15:29:54 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E554883.5020908@g.nevcal.com> Message-ID: <4E557B62.7030200@g.nevcal.com> On 8/24/2011 12:34 PM, Guido van Rossum wrote: > On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman wrote: >> On 8/24/2011 9:00 AM, Stefan Behnel wrote: >> >> Nick Coghlan, 24.08.2011 15:06: >> >> On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: >> >> In utf16.py, attached to http://bugs.python.org/issue12729 >> I propose for consideration a prototype of different solution to the 'mostly >> BMP chars, few non-BMP chars' case. Rather than expand every character from >> 2 bytes to 4, attach an array cpdex of character (ie code point, not code >> unit) indexes. Then for indexing and slicing, the correction is simple, >> simpler than I first expected: >> code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) >> where code-unit-index is the adjusted index into the full underlying >> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids >> most of the space penalty and the consequent time penalty of moving more >> bytes around and increasing cache misses. >> >> Interesting idea, but putting on my C programmer hat, I say -1. >> >> Non-uniform cell size = not a C array = standard C array manipulation >> idioms don't work = pain (no matter how simple the index correction >> happens to be). >> >> The nice thing about PEP 383 is that it gives us the smallest storage >> array that is both an ordinary C array and has sufficiently large >> individual elements to handle every character in the string. >> >> +1 >> >> Yes, this sounds like a nice benefit, but the problem is it is false. The >> correct statement would be: >> >> The nice thing about PEP 383 is that it gives us the smallest storage >> array that is both an ordinary C array and has sufficiently large >> individual elements to handle every Unicode codepoint in the string. > (PEP 393, I presume. :-) This statement might yet be made true :) >> As Tom eloquently describes in the referenced issue (is Tom ever >> non-eloquent?), not all characters can be represented in a single codepoint. > But this is also besides the point (except insofar where we have to > remind ourselves not to confuse the two in docs). In the docs, yes, and in programmer's minds (influenced by docs). >> It seems there are three concepts in Unicode, code units, codepoints, and >> characters, none of which are equivalent (and the first of which varies >> according to the encoding). It also seems (to me) that Unicode has failed >> in its original premise, of being an easy way to handle "big char" for "all >> languages" with fixed size elements, but it is not clear that its original >> premise is achievable regardless of the size of "big char", when mixed >> directionality is desired, and it seems that support of some single >> languages require mixed directionality, not to mention mixed language >> support. > I see nothing wrong with having the language's fundamental data types > (i.e., the unicode object, and even the re module) to be defined in > terms of codepoints, not characters, and I see nothing wrong with > len() returning the number of codepoints (as long as it is advertised > as such). Me neither. > After all UTF-8 also defines an encoding for a sequence of > code points. Characters that require two or more codepoints are not > represented special in UTF-8 -- they are represented as two or more > encoded codepoints. The added requirement that UTF-8 must only be used > to represent valid characters is just that -- it doesn't affect how > strings are encoded, just what is considered valid at a higher level. Yes, this is true. In one sense, though, since UTF-8-supporting code already has to deal with variable length codepoint encoding, support for variable length character encoding seems like a minor extension, not upsetting any concept of fixed-width optimizations, because such cannot be used. >> Given the required variability of character size in all presently Unicode >> defined encodings, I tend to agree with Tom that UTF-8, together with some >> technique of translating character index to code unit offset, may provide >> the best overall space utilization, and adequate CPU efficiency. > There is no doubt that UTF-8 is the most space efficient. I just don't > think it is worth giving up O(1) indexing of codepoints -- it would > change programmers' expectations too much. Programmers that have to deal with bidi or composed characters shouldn't have such expectations, of course. But there are many programmers who do not, or at least who think they do not, and they can retain their O(1) expectations, I suppose, until it bites them. > OTOH I am sold on getting rid of the added complexities of "narrow > builds" where not even all codepoints can be represented without using > surrogate pairs (i.e. two code units per codepoint) and indexing uses > code units instead of codepoints. I think this is an area where PEP > 393 has a huge advantage: users can get rid of their exceptions for > narrow builds. Yep, the only justification for narrow builds is in interfacing to underlying broken OS that happen to use that encoding... it might be slightly more efficient when doing API calls to such an OS. But most interesting programs do much more than I/O. >> On the >> other hand, there are large subsets of applications that simply do not >> require support for bidirectional text or composed characters, and for those >> that do not, it remains to be seen if the price to be paid for supporting >> those features is too high a price for such applications. So far, we don't >> have implementations to benchmark to figure that out! > I think you are saying that many apps can ignore the distinction > between codepoints and characters. Given the complexity of bidi > rendering and normalization (which will always remain an issue) I > agree; this is much less likely to be a burden than the narrow-build > issues with code units vs. codepoints. > > What should the stdlib do? It should try to skirt the issue where it > can (using the garbage-in-garbage-out principle) and advertise what it > supports where there is a difference. I don't see why all the stdlib > should be made aware of multi-codepoint-characters and other bidi > requirements, but it should be clear to the user who has such > requirements which stdlib operations they can safely use. It would seem helpful if the stdlib could have some support for efficient handling of Unicode characters in some representation. It would help address the class of applications that does care. Adding extra support for Unicode character handling sooner rather than later could be an performance boost to applications that do care about full character support, and I can only see the numbers of such applications increasing over time. Such could be built as a subtype of str, perhaps, but if done in Python, there would likely be a significant performance hit when going from str to "unicodeCharacterStr". >> What does this mean for Python? Well, if Python is willing to limit its >> support for applications to the subset for which the "big char" solution >> sufficient, then PEP 393 provides a way to do that, that looks to be pretty >> effective for reducing memory consumption for those applications that use >> short strings most of which can be classified by content into the 1 byte or >> 2 byte representations. Applications that support long strings are more >> likely to bitten by the occasional "outlier" character that is longer than >> the average character, doubling or quadrupling the space needed to represent >> such strings, and eliminating a significant portion of the space savings the >> PEP is providing for other applications. > This seems more of an intuition than a fact. I could easily imagine > the facts being that even for large strings, usually either there are > no outliers, or there is a significant number of outliers. (E.g. Tom > Christiansen's OSCON preso falls in the latter category :-). > > As long as it *works* I don't really mind that there are some extreme > cases that are slow. You'll always have that. Yes, it is intuition, regarding memory consumption. It is not at all clear how different the "occasional outlier character" is than your "significant number of outliers". Tom's presentation certainly was regarding bodies of text which varied from ASCII to fully non-ASCII. The memory characteristics of long string handling would certainly be non-intuitive, when you can process a file of size N with a particular program, but can't process a smaller file because it has a funny character in it, and suddenly you are out of space. > >> Benchmarks may or may not fully >> reflect the actual requirements of all applications, so conclusions based on >> benchmarking can easily be blind-sided the realities of other applications, >> unless the benchmarks are carefully constructed. > Yeah, it's a learning process. > >> It is possible that the ideas in PEP 393, with its support for multiple >> underlying representations, could be the basis for some more complex >> representations that would better support characters rather than only >> supporting code points, but Martin has stated he is not open to additional >> representations, so the PEP itself cannot be that basis (although with care >> which may or may not be taken in the implementation of the PEP, the >> implementation may still provide that basis). > There is always the possibility of representations that are defined > purely by userland code and can only be manipulated by that specific > code. But expecting C extensions to support new representations that > haven't been defined yet sounds like a bad idea. While they can and should be prototyped in Python for functional correctness, I would rather expect such representations to be significantly slower in Python than in C. But that is just intuition also. The PEP makes a nice extension to str representations, but I'm not sure it picks the most useful ones, in that while it is picking cases that are well understood and are in use, they may not be the most effective ones (due to the strange memory consumption characteristics that outliers can introduce). My intuition says that a UTF-8 representation (or Tom's/Perl's looser utf8) would be a handy representation to have. But maybe it should be a different type than str... str8? I suppose that is -ideas land. -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothy.c.delaney at gmail.com Thu Aug 25 00:33:33 2011 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Thu, 25 Aug 2011 08:33:33 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <201108242310.32446.victor.stinner@haypocalc.com> References: <4E554883.5020908@g.nevcal.com> <201108242310.32446.victor.stinner@haypocalc.com> Message-ID: On 25 August 2011 07:10, Victor Stinner wrote: > > I used stringbench and "./python -m test test_unicode". I plan to try > iobench. > > Which other benchmark tool should be used? Should we write a new one? I think that the PyPy benchmarks (or at least selected tests such as slowspitfire) would probably exercise things quite well. http://speed.pypy.org/about/ Tim Delaney -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Aug 25 01:28:53 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 24 Aug 2011 16:28:53 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E557B62.7030200@g.nevcal.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E554883.5020908@g.nevcal.com> <4E557B62.7030200@g.nevcal.com> Message-ID: On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman wrote: > It would seem helpful if the stdlib could have some support for efficient > handling of Unicode characters in some representation.? It would help > address the class of applications that does care. I claim that we have insufficient understanding of their needs to put anything in the stdlib. Wait and see is a good strategy here. > Adding extra support for > Unicode character handling sooner rather than later could be an performance > boost to applications that do care about full character support, and I can > only see the numbers of such applications increasing over time.? Such could > be built as a subtype of str, perhaps, but if done in Python, there would > likely be a significant performance hit when going from str to > "unicodeCharacterStr". Sounds like overengineering to me. The right time to add something to the stdlib is when a large number of apps *currently* need something, not when you expect that they might need it in the future. (There just are too many possible futures to plan for them all. YAGNI rules.) -- --Guido van Rossum (python.org/~guido) From turnbull at sk.tsukuba.ac.jp Thu Aug 25 02:11:42 2011 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 09:11:42 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <1314206189.3549.2.camel@localhost.localdomain> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <20110824183846.2b392f77@pitrou.net> <87mxeyzq63.fsf@uwakimon.sk.tsukuba.ac.jp> <1314206189.3549.2.camel@localhost.localdomain> Message-ID: <87mxeyxsch.fsf@uwakimon.sk.tsukuba.ac.jp> Antoine Pitrou writes: > Le jeudi 25 ao?t 2011 ? 02:15 +0900, Stephen J. Turnbull a ?crit : > > Antoine Pitrou writes: > > > On Thu, 25 Aug 2011 01:34:17 +0900 > > > "Stephen J. Turnbull" wrote: > > > > > > > > Martin has long claimed that the fact that I/O is done in terms of > > > > UTF-16 means that the internal representation is UTF-16 > > > > > > Which I/O? > > > > Eg, display of characters in the interpreter. > > I don't know why you say it's "done in terms of UTF-16", then. Unicode > strings are simply encoded to whatever character set is detected as the > terminal's character set. But it's not "simple" at the level we're talking about! Specifically, *in-memory* surrogates are properly respected when doing the encoding, and therefore such I/O is not UCS-2 or "raw code units". This treatment is different from sizing and indexing of unicodes, where surrogates are not treated differently from other code points. From turnbull at sk.tsukuba.ac.jp Thu Aug 25 02:31:30 2011 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 09:31:30 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Terry Reedy writes: > Please suggest a re-wording then, as it is a bug for doc and behavior to > disagree. Strings contain Unicode code units, which for most purposes can be treated as Unicode characters. However, even as "simple" an operation as "s1[0] == s2[0]" cannot be relied upon to give Unicode-conforming results. The second sentence remains true under PEP 393. > > > For the purpose of my sentence, the same thing in that code points > > > correspond to characters, > > > > Not in Unicode, they do not. By definition, a small number of code > > points (eg, U+FFFF) *never* did and *never* will correspond to > > characters. > > On computers, characters are represented by code points. What about the > other way around? http://www.unicode.org/glossary/#C says > code point: > 1) i in range(0x11000) > 2) "A value, or position, for a character" > (To muddy the waters more, 'character' has multiple definitions also.) > You are using 1), I am using 2) ;-(. No, you're not. You are claiming an isomorphism, which Unicode goes to great trouble to avoid. > I think you have it backwards. I see the current situation as the purity > of the C code beating the practicality for the user of getting right > answers. Sophistry. "Always getting the right answer" is purity. > > The thing is, that 90% of applications are not really going to care > > about full conformance to the Unicode standard. > > I remember when Intel argued that 99% of applications were not going to > be affected when the math coprocessor in its then new chips occasionally > gave 'non-standard' answers with certain divisors. In the case of Intel, the people who demanded standard answers did so for efficiency reasons -- they needed the FPU to DTRT because implementing FP in software was always going to be too slow. CPython, IMO, can afford to trade off because the implementation will necessarily be in software, and can be added later as a Python or C module. > I believe my scheme could be extended to solve [conformance for > composing characters] also. It would require more pre-processing > and more knowledge than I currently have of normalization. I have > the impression that the grapheme problem goes further than just > normalization. Yes and yes. But now you're talking about database lookups for every character (to determine if it's a composing character). Efficiency of a generic implementation isn't going to happen. Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)". And Nick's point about non-uniform arrays is telling. I have 20 years of experience with an implementation of text as a non-uniform array which presents an array API, and *everything* needs to be special-cased for efficiency, and *any* small change can have show-stopping performance implications. Python can probably do better than Emacs has done due to much better leadership in this area, but I still think it's better to make full conformance optional. From stephen at xemacs.org Thu Aug 25 02:36:14 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 09:36:14 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E554883.5020908@g.nevcal.com> Message-ID: <87k4a2xr7l.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > I see nothing wrong with having the language's fundamental data types > (i.e., the unicode object, and even the re module) to be defined in > terms of codepoints, not characters, and I see nothing wrong with > len() returning the number of codepoints (as long as it is advertised > as such). In fact, the Unicode Standard, Version 6, goes farther (to code units): 2.7 Unicode Strings A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. (p. 32). From guido at python.org Thu Aug 25 04:29:39 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 24 Aug 2011 19:29:39 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull wrote: > Terry Reedy writes: > > ?> Please suggest a re-wording then, as it is a bug for doc and behavior to > ?> disagree. > > ? ?Strings contain Unicode code units, which for most purposes can be > ? ?treated as Unicode characters. ?However, even as "simple" an > ? ?operation as "s1[0] == s2[0]" cannot be relied upon to give > ? ?Unicode-conforming results. > > The second sentence remains true under PEP 393. Really? If strings contain code units, that expression compares code units. What is non-conforming about comparing two code points? They are just integers. Seriously, what does Unicode-conforming mean here? It would be better to specify chapter and verse (e.g. is it a specific thing defined by the dreaded TR18?) > ?> > ? > ?For the purpose of my sentence, the same thing in that code points > ?> > ? > ?correspond to characters, > ?> > > ?> > Not in Unicode, they do not. ?By definition, a small number of code > ?> > points (eg, U+FFFF) *never* did and *never* will correspond to > ?> > characters. > ?> > ?> On computers, characters are represented by code points. What about the > ?> other way around? http://www.unicode.org/glossary/#C says > ?> code point: > ?> 1) i in range(0x11000) > ?> 2) "A value, or position, for a character" > ?> (To muddy the waters more, 'character' has multiple definitions also.) > ?> You are using 1), I am using 2) ;-(. > > No, you're not. ?You are claiming an isomorphism, which Unicode goes > to great trouble to avoid. I don't know that we will be able to educate our users to the point where they will use code unit, code point, character, glyph, character set, encoding, and other technical terms correctly. TBH even though less than two hours ago I composed a reply in this thread, I've already forgotten which is a code point and which is a code unit. > ?> I think you have it backwards. I see the current situation as the purity > ?> of the C code beating the practicality for the user of getting right > ?> answers. > > Sophistry. ?"Always getting the right answer" is purity. Eh? In most other areas Python is pretty careful not to promise to "always get the right answer" since what is right is entirely in the user's mind. We often go to great lengths of defining how things work so as to set the right expectations. For example, variables in Python work differently than in most other languages. Now I am happy to admit that for many Unicode issues the level at which we have currently defined things (code units, I think -- the thingies that encodings are made of) is confusing, and it would be better to switch to the others (code points, I think). But characters are right out. > ?> > The thing is, that 90% of applications are not really going to care > ?> > about full conformance to the Unicode standard. > ?> > ?> I remember when Intel argued that 99% of applications were not going to > ?> be affected when the math coprocessor in its then new chips occasionally > ?> gave 'non-standard' answers with certain divisors. > > In the case of Intel, the people who demanded standard answers did so > for efficiency reasons -- they needed the FPU to DTRT because > implementing FP in software was always going to be too slow. ?CPython, > IMO, can afford to trade off because the implementation will > necessarily be in software, and can be added later as a Python or C module. It is not so easy to change expectations about O(1) vs. O(N) behavior of indexing however. IMO we shouldn't try and hence we're stuck with operations defined in terms of code thingies instead of (mostly mythical) characters. > ?> I believe my scheme could be extended to solve [conformance for > ?> composing characters] also. It would require more pre-processing > ?> and more knowledge than I currently have of normalization. I have > ?> the impression that the grapheme problem goes further than just > ?> normalization. > > Yes and yes. ?But now you're talking about database lookups for every > character (to determine if it's a composing character). ?Efficiency of > a generic implementation isn't going to happen. Let's take small steps. Do the evolutionary thing. Let's get things right so users won't have to worry about code points vs. code units any more. A conforming library for all things at the character level can be developed later, once we understand things better at that level (again, most developers don't even understand most of the subtleties, so I claim we're not ready). > Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's > pronouncement, "indexing is going to be O(1)". I still think that. It would be too big of a cultural upheaval to change it. > ?And Nick's point about > non-uniform arrays is telling. ?I have 20 years of experience with an > implementation of text as a non-uniform array which presents an array > API, and *everything* needs to be special-cased for efficiency, and > *any* small change can have show-stopping performance implications. > > Python can probably do better than Emacs has done due to much better > leadership in this area, but I still think it's better to make full > conformance optional. This I agree with (though if you were referring to me with "leadership" I consider myself woefully underinformed about Unicode subtleties). I also suspect that Unicode "conformance" (however defined) is more part of a political battle than an actual necessity. I'd much rather have us fix Tom Christiansen's specific bugs than chase the elusive "standard conforming". (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards." http://en.wikipedia.org/wiki/Stinking_badges :-) -- --Guido van Rossum (python.org/~guido) From guido at python.org Thu Aug 25 04:33:51 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 24 Aug 2011 19:33:51 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87k4a2xr7l.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <4E554883.5020908@g.nevcal.com> <87k4a2xr7l.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull wrote: > Guido van Rossum writes: > > ?> I see nothing wrong with having the language's fundamental data types > ?> (i.e., the unicode object, and even the re module) to be defined in > ?> terms of codepoints, not characters, and I see nothing wrong with > ?> len() returning the number of codepoints (as long as it is advertised > ?> as such). > > In fact, the Unicode Standard, Version 6, goes farther (to code units): > > ? ?2.7 ?Unicode Strings > > ? ?A Unicode string data type is simply an ordered sequence of code > ? ?units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit > ? ?code units, a Unicode 16-bit string is an ordered sequence of > ? ?16-bit code units, and a Unicode 32-bit string is an ordered > ? ?sequence of 32-bit code units. > > ? ?Depending on the programming environment, a Unicode string may or > ? ?may not be required to be in the corresponding Unicode encoding > ? ?form. For example, strings in Java, C#, or ECMAScript are Unicode > ? ?16-bit strings, but are not necessarily well-formed UTF-16 > ? ?sequences. > > (p. 32). I am assuming that that definition only applies to use of the term "unicode string" within the standard and has no bearing on how programming languages are allowed to use the term, as that would be preposterous. (They can define what they mean by terms like well-formed and conforming etc., and I won't try to go against that. But limiting what can be called a unicode string feels like unproductive coddling.) -- --Guido van Rossum (python.org/~guido) From ncoghlan at gmail.com Thu Aug 25 04:47:20 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 25 Aug 2011 12:47:20 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum wrote: > Now I am happy to admit that for many Unicode issues the level at > which we have currently defined things (code units, I think -- the > thingies that encodings are made of) is confusing, and it would be > better to switch to the others (code points, I think). But characters > are right out. Indeed, code points are the abstract concept and code units are the specific byte sequences that are used for serialisation (FWIW, I'm going to try to keep this straight in the future by remembering that the Unicode character set is defined as abstract points on planes, just like geometry). With narrow builds, code units can currently come into play internally, but with PEP 393 everything internal will be working directly with code points. Normalisation, combining characters and bidi issues may still affect the correctness of unicode comparison and slicing (and other text manipulation), but there are limits to how much of the underlying complexity we can effectively hide without being misleading. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From guido at python.org Thu Aug 25 05:11:20 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 24 Aug 2011 20:11:20 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan wrote: > On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum wrote: >> Now I am happy to admit that for many Unicode issues the level at >> which we have currently defined things (code units, I think -- the >> thingies that encodings are made of) is confusing, and it would be >> better to switch to the others (code points, I think). But characters >> are right out. > > Indeed, code points are the abstract concept and code units are the > specific byte sequences that are used for serialisation (FWIW, I'm > going to try to keep this straight in the future by remembering that > the Unicode character set is defined as abstract points on planes, > just like geometry). Hm, code points still look pretty concrete to me (integers in the range 0 .. 2**21) and code units don't feel like byte sequences to me (at least not UTF-16 code units -- in Python at least you can think of them as integers in the range 0 .. 2**16). > With narrow builds, code units can currently come into play > internally, but with PEP 393 everything internal will be working > directly with code points. Normalisation, combining characters and > bidi issues may still affect the correctness of unicode comparison and > slicing (and other text manipulation), but there are limits to how > much of the underlying complexity we can effectively hide without > being misleading. Let's just define a Unicode string to be a sequence of code points and let libraries deal with the rest. Ok, methods like lower() should consider characters, but indexing/slicing should refer to code points. Same for '=='; we can have a library that compares by applying (or assuming?) certain normalizations. Tom C tells me that case-less comparison cannot use a.lower() == b.lower(); fine, we can add that operation to the library too. But this exceeds the scope of PEP 393, right? -- --Guido van Rossum (python.org/~guido) From ncoghlan at gmail.com Thu Aug 25 05:48:49 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 25 Aug 2011 13:48:49 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum wrote: >> With narrow builds, code units can currently come into play >> internally, but with PEP 393 everything internal will be working >> directly with code points. Normalisation, combining characters and >> bidi issues may still affect the correctness of unicode comparison and >> slicing (and other text manipulation), but there are limits to how >> much of the underlying complexity we can effectively hide without >> being misleading. > > Let's just define a Unicode string to be a sequence of code points and > let libraries deal with the rest. Ok, methods like lower() should > consider characters, but indexing/slicing should refer to code points. > Same for '=='; we can have a library that compares by applying (or > assuming?) certain normalizations. Tom C tells me that case-less > comparison cannot use a.lower() == b.lower(); fine, we can add that > operation to the library too. But this exceeds the scope of PEP 393, > right? Yep, I was agreeing with you on this point - I think you're right that if we provide a solid code point based core Unicode type (perhaps with some character based methods), then library support can fill the gap between handling code points and handling characters. In particular, a unicode character based string type would be significantly easier to write in Python than it would be in C (after skimming Tom's bug report at http://bugs.python.org/issue12729, I better understand the motivation and desire for that kind of interface and it sounds like Terry's prototype is along those lines). Once those mappings are thrashed out outside the core, then there may be something to incorporate directly around the 3.4 timeframe (or potentially even in 3.3, since it should already be possible to develop such a wrapper based on UCS4 builds of 3.2) However, there may an important distinction to be made on the Python-the-language vs CPython-the-implementation front: is another implementation (e.g. PyPy) *allowed* to implement character based indexing instead of code point based for 2.x unicode/3.x str type? Or is the code point indexing part of the language spec, and any character based indexing needs to be provided via a separate type or module? Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From stephen at xemacs.org Thu Aug 25 06:12:17 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 13:12:17 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull > wrote: > > ? ?Strings contain Unicode code units, which for most purposes can be > > ? ?treated as Unicode characters. ?However, even as "simple" an > > ? ?operation as "s1[0] == s2[0]" cannot be relied upon to give > > ? ?Unicode-conforming results. > > > > The second sentence remains true under PEP 393. > > Really? If strings contain code units, that expression compares code > units. That's true out of context, but in context it's "which for most purposes can be treated as Unicode characters", and this is what Terry is concerned with, as well. > What is non-conforming about comparing two code points? Unicode conformance means treating characters correctly. In particular, s1 and s2 might be NFC and NFD forms of the same string with a combining character at s2[1], or s1[1] and s[2] might be a non-combining character and a combining character respectively. > Seriously, what does Unicode-conforming mean here? Chapter 3, all verses. Here, specifically C6, p. 60. One would have to define the process executing "s1[0] == s2[0]" to be sure that even in the cases cited in the previous paragraph non-conformance is occurring, but one example of a process where that is non-conforming (without additional code to check for trailing combining characters) is in comparison of Vietnamese filenames generated on a Mac vs. those generated on a Linux host. > > No, you're not. ?You are claiming an isomorphism, which Unicode goes > > to great trouble to avoid. > > I don't know that we will be able to educate our users to the point > where they will use code unit, code point, character, glyph, character > set, encoding, and other technical terms correctly. Sure. I got it wrong myself earlier. I think that the right thing to do is to provide a conformant implementation of Unicode text in the stdlib (a long run goal, see below), and call that "Unicode", while we call strings "strings". > Now I am happy to admit that for many Unicode issues the level at > which we have currently defined things (code units, I think -- the > thingies that encodings are made of) is confusing, and it would be > better to switch to the others (code points, I think). Yes, and AFAICT (I'm better at reading standards than I am at reading Python implementation) PEP 393 allows that. > But characters are right out. +1 > It is not so easy to change expectations about O(1) vs. O(N) behavior > of indexing however. IMO we shouldn't try and hence we're stuck with > operations defined in terms of code thingies instead of (mostly > mythical) characters. Well, O(N) is not really the question. It's really O(log N), as Terry says. Is that out, too? I can verify that it's possible to do it in practice in the long term. In my experience with Emacs, even with 250 MB files, O(log N) mostly gives acceptable performance in an interactive editor, as well as many scripted textual applications. The problems that I see are (1) It's very easy to write algorithms that would be O(N) for a true array, but then become O(N log N) or worse (and the coefficient on the O(log N) algorithm is way higher to start). I guess this would kill the idea, but. (2) Maintenance is fragile; it's easy to break the necessary caches with feature additions and bug fixes. (However, I don't think this would be as big a problem for Python, due to its more disciplined process, as it has been for XEmacs.) You might think space for the caches would be a problem, but that has turned out not to be the case for Emacsen. > Let's take small steps. Do the evolutionary thing. Let's get things > right so users won't have to worry about code points vs. code units > any more. A conforming library for all things at the character level > can be developed later, once we understand things better at that level > (again, most developers don't even understand most of the subtleties, > so I claim we're not ready). I don't think anybody does. That's one reason there's a new version of Unicode every few years. > This I agree with (though if you were referring to me with > "leadership" I consider myself woefully underinformed about Unicode > subtleties). MvL and MAL are not, however, and there are plenty of others who make contributions -- in an orderly fashion. > I also suspect that Unicode "conformance" (however defined) is more > part of a political battle than an actual necessity. I'd much > rather have us fix Tom Christiansen's specific bugs than chase the > elusive "standard conforming". Well, I would advocate specifying which parts of the standard we target and which not (for any given version). The goal of full "Chapter 3" conformance should be left up to a library on PyPI for the nonce IMO. I agree that fixing specific bugs should be given precedence over "conformance chasing," but implementation should conform to the appropriate parts of the standard. > (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' > standards." http://en.wikipedia.org/wiki/Stinking_badges :-) RMS beat you to that. Not good company to be in, in this case: he specifically disclaims the goal of portability to non-GNU-System systems. From stefan_ml at behnel.de Thu Aug 25 06:46:50 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Aug 2011 06:46:50 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <201108250029.19506.victor.stinner@haypocalc.com> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <201108250029.19506.victor.stinner@haypocalc.com> Message-ID: Victor Stinner, 25.08.2011 00:29: >> With this PEP, the unicode object overhead grows to 10 pointer-sized >> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. >> Does it have any adverse effects? > > For pure ASCII, it might be possible to use a shorter struct: > > typedef struct { > PyObject_HEAD > Py_ssize_t length; > Py_hash_t hash; > int state; > Py_ssize_t wstr_length; > wchar_t *wstr; > /* no more utf8_length, utf8, str */ > /* followed by ascii data */ > } _PyASCIIObject; > (-2 pointer -1 ssize_t: 56 bytes) > > => "a" is 58 bytes (with utf8 for free, without wchar_t) > > For object allocated with the new API, we can use a shorter struct: > > typedef struct { > PyObject_HEAD > Py_ssize_t length; > Py_hash_t hash; > int state; > Py_ssize_t wstr_length; > wchar_t *wstr; > Py_ssize_t utf8_length; > char *utf8; > /* no more str pointer */ > /* followed by latin1/ucs2/ucs4 data */ > } _PyNewUnicodeObject; > (-1 pointer: 72 bytes) > > => "?" is 74 bytes (without utf8 / wchar_t) > > For the legacy API: > > typedef struct { > PyObject_HEAD > Py_ssize_t length; > Py_hash_t hash; > int state; > Py_ssize_t wstr_length; > wchar_t *wstr; > Py_ssize_t utf8_length; > char *utf8; > void *str; > } _PyLegacyUnicodeObject; > (same size: 80 bytes) > > => "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t) > > The current struct: > > typedef struct { > PyObject_HEAD > Py_ssize_t length; > Py_UNICODE *str; > Py_hash_t hash; > int state; > PyObject *defenc; > } PyUnicodeObject; > > => "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is > wchar_t) > > ... but the code (maybe only the macros?) and debuging will be more complex. That's an interesting idea. However, it's not required to do this as part of the PEP 393 implementation. This can be added later on if the need evidently arises in general practice. Also, there is always the possibility to simply intern very short strings in order to avoid their multiplication in memory. Long strings don't suffer from this as the data size quickly dominates. User code that works with a lot of short strings would likely do the same. BTW, I would expect that many short strings either go away as quickly as they appeared (e.g. in a parser) or were brought in as literals and are therefore interned anyway. That's just one reason why I suggest to wait for a prove of inefficiency in the real world (and, obviously, to test your own code with this as quickly as possible). >> Will the format codes returning a Py_UNICODE pointer with >> PyArg_ParseTuple be deprecated? > > Because Python 2.x is still dominant and it's already hard enough to port C > modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*). Well, it will be quite inefficient in future CPython versions, so I think if it's not officially deprecated at some point, it will deprecate itself for efficiency reasons. Better make it clear that it's worth investing in better performance here. >> Do you think the wstr representation could be removed in some future >> version of Python? > > Conversion to wchar_t* is common, especially on Windows. That's an issue. However, I cannot say how common this really is in practice. Surely depends on the specific code, right? How common is it in core CPython? > But I don't know if > we *have to* cache the result. Is it cached by the way? Or is wstr only used > when a string is created from Py_UNICODE? If it's so common on Windows, maybe it should only be cached there? Stefan From stefan_ml at behnel.de Thu Aug 25 07:09:28 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Aug 2011 07:09:28 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E553FBC.7080501@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> Message-ID: "Martin v. L?wis", 24.08.2011 20:15: > Guido has agreed to eventually pronounce on PEP 393. Before that can > happen, I'd like to collect feedback on it. There have been a number > of voice supporting the PEP in principle Absolutely. > - conditions you would like to pose on the implementation before > acceptance. I'll see which of these can be resolved, and list > the ones that remain open. Just repeating here that I'd like to see the buffer void* changed into a union of pointers that state the exact layout type. IMHO, that would clarify the implementation and make it clearer that it's correct to access the data buffer as a flat array. (Obviously, code that does that is subject to future changes, that's why there are macros.) Stefan From v+python at g.nevcal.com Thu Aug 25 07:34:12 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 24 Aug 2011 22:34:12 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E55DED4.1000803@g.nevcal.com> On 8/24/2011 7:29 PM, Guido van Rossum wrote: > (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' > standards."http://en.wikipedia.org/wiki/Stinking_badges :-) Which deserves an appropriate, follow-on, misquote: Guido says the Unicode standard stinks. ??? <- and a Unicode smiley to go with it! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Thu Aug 25 07:58:10 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 14:58:10 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87sjoq9gnh.fsf@uwakimon.sk.tsukuba.ac.jp> Nick Coghlan writes: > GvR writes: > > Let's just define a Unicode string to be a sequence of code points and > > let libraries deal with the rest. Ok, methods like lower() should > > consider characters, but indexing/slicing should refer to code points. > > Same for '=='; we can have a library that compares by applying (or > > assuming?) certain normalizations. Tom C tells me that case-less > > comparison cannot use a.lower() == b.lower(); fine, we can add that > > operation to the library too. But this exceeds the scope of PEP 393, > > right? > > Yep, I was agreeing with you on this point - I think you're right that > if we provide a solid code point based core Unicode type (perhaps with > some character based methods), then library support can fill the gap > between handling code points and handling characters. +1 I don't really see an alternative to this approach. The underlying array has to be exposed because there are too many applications that can take advantage of it, and analysis of decomposed characters requires it. Making that array be an array of code points is a really good idea, and Python already has that in the UCS-4 build. PEP 393 is "just" a space optimization that allows getting rid of the narrow build, with all its wartiness. > something to incorporate directly around the 3.4 timeframe (or > potentially even in 3.3, since it should already be possible to > develop such a wrapper based on UCS4 builds of 3.2) I agree that it's possible, but I estimate that it's not feasible for 3.3 because we don't yet know the requirements. This one really needs to ferment and mature in PyPI for a while because we just don't know how far the scope of user needs is going to extend. Bidi is a mudball[1], confusable character indexes sound like a cool idea for the web and email but is anybody really going to use them?, etc. > However, there may an important distinction to be made on the > Python-the-language vs CPython-the-implementation front: is another > implementation (e.g. PyPy) *allowed* to implement character based > indexing instead of code point based for 2.x unicode/3.x str type? Or > is the code point indexing part of the language spec, and any > character based indexing needs to be provided via a separate type or > module? +1 for language spec. Remember, there are cases in Unicode where you'd like to access base characters and the like. So you need to be able to get at individual code points in an NFD string. You shouldn't need to use different code for that in different implementations of Python. Footnotes: [1] Sure, we can implement the UAX#9 bidi algorithm, but it's not good enough by itself: something as simple as "File name (default {0}): ".format(name) can produce disconcerting results if the whole resulting string is treated by the UBA. Specifically, using the usual convention of uppercase letters being an RTL script, name = "ABCD" will result in the prompt: File name (default :(DCBA _ (where _ denotes the position of the insertion cursor). The Hebrew speakers on emacs-devel agreed that an example using a real Hebrew string didn't look right to them, either. From jxo6948 at rit.edu Thu Aug 25 09:12:56 2011 From: jxo6948 at rit.edu (John O'Connor) Date: Thu, 25 Aug 2011 00:12:56 -0700 Subject: [Python-Dev] FileSystemError or FilesystemError? In-Reply-To: References: <20110823202004.0bb63490@pitrou.net> Message-ID: +1 FileSystemError - For already stated reasons. - John -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.ewing at canterbury.ac.nz Thu Aug 25 05:34:27 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 25 Aug 2011 15:34:27 +1200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E55C2C3.3060205@canterbury.ac.nz> On 25/08/11 14:29, Guido van Rossum wrote: > Let's get things > right so users won't have to worry about code points vs. code units > any more. What about things like the surrogateescape codec that deliberately use code units in non-standard ways? Will tricks like that still be possible if the code-unit level is hidden from the programmer? -- Greg From martin at v.loewis.de Thu Aug 25 09:36:06 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 25 Aug 2011 09:36:06 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E55FB66.8030802@v.loewis.de> >> Strings contain Unicode code units, which for most purposes can be >> treated as Unicode characters. However, even as "simple" an >> operation as "s1[0] == s2[0]" cannot be relied upon to give >> Unicode-conforming results. >> >> The second sentence remains true under PEP 393. > > Really? If strings contain code units, that expression compares code > units. What is non-conforming about comparing two code points? They > are just integers. > > Seriously, what does Unicode-conforming mean here? I think he's referring to combining characters and normal forms. 2.12 starts with "In cases involving two or more sequences considered to be equivalent, the Unicode Standard does not prescribe one particular sequence as being the correct one; instead, each sequence is merely equivalent to the others" That could be read to imply that the == operator should determine whether two strings are equivalent. However, the Unicode standard clearly leaves API design to the programming environment, and has the notion of conformance only for processes. So saying that Python is or is not unicode-conforming is, strictly speaking, meaningless. The closest conformance requirement in that respect is C6 "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." However, that explicitly does *not* support the conformance statement that Stephen made. They elaborate "Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them." So practicality beats purity even in Unicode conformance: the == operator of Python can reasonably treat equivalent strings as unequal (and there is a good reason for that, indeed). Processes should not expect that other applications make the same distinction, so they need to cope if it matters to them. There are different way to do that: - normalize all strings on input, and then use == - use a different comparison operation that always normalizes its input first > This I agree with (though if you were referring to me with > "leadership" I consider myself woefully underinformed about Unicode > subtleties). I also suspect that Unicode "conformance" (however > defined) is more part of a political battle than an actual necessity. Fortunately, it's much better than that. Unicode had very clear conformance requirements for a long time, and they aren't hard to meet. Wrt. C6, Python could certainly improve, e.g. by caching whether a string had been determined to be in normal form, so that applications can more reasonably apply normalization to all strings they ever want to compare. Regards, Martin From martin at v.loewis.de Thu Aug 25 09:45:48 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 25 Aug 2011 09:45:48 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E55FDAC.9010605@v.loewis.de> > > What is non-conforming about comparing two code points? > > Unicode conformance means treating characters correctly. Re-read the text. You are interpreting something that isn't there. > > Seriously, what does Unicode-conforming mean here? > > Chapter 3, all verses. Here, specifically C6, p. 60. One would have > to define the process executing "s1[0] == s2[0]" to be sure that even > in the cases cited in the previous paragraph non-conformance is > occurring No, that's explicitly *not* what C6 says. Instead, it says that a process that treats s1 and s2 differently shall not assume that others will do the same, i.e. that it is ok to treat them the same even though they have different code points. Treating them differently is also conforming. Regards, Martin From martin at v.loewis.de Thu Aug 25 09:50:08 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 25 Aug 2011 09:50:08 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E55C2C3.3060205@canterbury.ac.nz> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: <4E55FEB0.6070706@v.loewis.de> > What about things like the surrogateescape codec that > deliberately use code units in non-standard ways? Will > tricks like that still be possible if the code-unit > level is hidden from the programmer? Most certainly. In the PEP-393 representation, the surrogate characters can readily be represented (and would imply atleast the two-byte form), but they will never take their UTF-16 function (i.e. the UTF-8 codec won't try to combine surrogate pairs), so they can be used for surrogateescape and other functions. Of course, in strict error mode, codecs will refuse to encode them (notice that surrogateescape is an error handler, not a codec). Regards, Martin From martin at v.loewis.de Thu Aug 25 10:24:39 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 25 Aug 2011 10:24:39 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <20110824203228.3e00874d@pitrou.net> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> Message-ID: <4E5606C7.9000404@v.loewis.de> > With this PEP, the unicode object overhead grows to 10 pointer-sized > words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. > Does it have any adverse effects? If I count correctly, it's only three *additional* words (compared to 3.2): four new ones, minus one that is removed. In addition, it drops a memory block. Assuming a malloc overhead of two pointers per malloc block, we get one additional pointer. On a 32-bit machine with a 32-bit wchar_t, pure-ASCII strings of length 1 (+NUL) will take the same memory either way: 8 bytes for the characters in 3.2, 2 bytes in 3.3 + extra pointer + padding. Strings of 2 or more characters will take more space in 3.2. On a 32-bit machine with a 16-bit wchar_t, pure-ASCII strings up to 3 characters take the same space either way; space savings start at four characters. On a 64-bit machine with a 16-bit wchar_t, assuming a malloc minimum block size of 16 bytes, pure-ASCII strings of up to 7 characters take the same space. For 8 characters, 3.2 will need 32 bytes for the characters, whereas 3.3 will only take 16 bytes (due to padding). So: no, I can't see any adverse effects. Details depend on the malloc implementation, though. A slight memory increase may occur on compared to a narrow build may occur for strings that use non-Latin-1, and a large increase for strings that use non-BMP characters. The real issue of memory consumption is the alternative representations, if created. That applies for the default encoding in 3.2 as well as the wchar_t and UTF-8 representations in 3.3. > Are there any plans to make instantiation of small strings fast enough? > Or is it already as fast as it should be? I don't have any plans, and I don't see potential. Compared to 3.2, it saves a malloc call, which may be quite an improvement. OTOH, it needs to iterate over the characters twice, to find the largest character. If you are referring to the reuse of Unicode objects: that's currently not done, and is difficult to do in the 3.2 way due to the various sizes of characters. One idea might be to only reuse UCS1 strings, and then keep a freelist for these based on the string length. > When interfacing with the Win32 "wide" APIs, what is the recommended > way to get the required LPCWSTR? As before: PyUnicode_AsUnicode. > Will the format codes returning a Py_UNICODE pointer with > PyArg_ParseTuple be deprecated? Not for 3.3, no. > Do you think the wstr representation could be removed in some future > version of Python? Yes. This probably has to wait for Python 4, though. > Is PyUnicode_Ready() necessary for all unicode objects, or only those > allocated through the legacy API? Only for the latter (although it doesn't hurt to apply it to all of them). > ?The Py_Unicode representation is not instantaneously available?: you > mean the Py_UNICODE representation? Thanks, fixed. >> - conditions you would like to pose on the implementation before >> acceptance. I'll see which of these can be resolved, and list >> the ones that remain open. > > That it doesn't significantly slow down benchmarks such as stringbench > and iobench. Can you please quantify "significantly"? Also, having a complete list of benchmarks to perform prior to acceptance would be helpful. Thanks, Martin From victor.stinner at haypocalc.com Thu Aug 25 11:10:50 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 25 Aug 2011 11:10:50 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E56119A.8010900@haypocalc.com> Le 25/08/2011 06:12, Stephen J. Turnbull a ?crit : > > Let's take small steps. Do the evolutionary thing. Let's get things > > right so users won't have to worry about code points vs. code units > > any more. A conforming library for all things at the character level > > can be developed later, once we understand things better at that level > > (again, most developers don't even understand most of the subtleties, > > so I claim we're not ready). > > I don't think anybody does. That's one reason there's a new version > of Unicode every few years. It took some weeks (months?) to write the PEP, and months to implement it. This PEP is only a minor change of the implementation of Unicode in Python. A larger change will take much more time (and maybe change/break the C and/or Python API a little bit more). If you are able to implement your specfication (a Unicode type with a "real" character API), please write a PEP and implement it. You may begin with a prototype in Python, and then rewrite it in C. But I don't think that any core developer will do that for you. It's not how free software works. At least, I don't think that anyone will do that for free :-) (I bet that many developers will accept to implement that for money :-)) Victor From victor.stinner at haypocalc.com Thu Aug 25 11:17:06 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Thu, 25 Aug 2011 11:17:06 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <201108250029.19506.victor.stinner@haypocalc.com> Message-ID: <4E561312.3030404@haypocalc.com> Le 25/08/2011 06:46, Stefan Behnel a ?crit : >> Conversion to wchar_t* is common, especially on Windows. > > That's an issue. However, I cannot say how common this really is in > practice. Surely depends on the specific code, right? How common is it > in core CPython? Quite all functions taking text as argument on Windows expects wchar_t* strings (UTF-16). In Python, we pass a "Py_UNICODE*" (PyUnicode_AS_UNICODE or PyUnicode_AsUnicode) because Py_UNICODE is wchar_t on Windows. Victor From stephen at xemacs.org Thu Aug 25 11:39:46 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 18:39:46 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E55FDAC.9010605@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55FDAC.9010605@v.loewis.de> Message-ID: <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > No, that's explicitly *not* what C6 says. Instead, it says that a > process that treats s1 and s2 differently shall not assume that others > will do the same, i.e. that it is ok to treat them the same even though > they have different code points. Treating them differently is also > conforming. Then what requirement does C6 impose, in your opinion? It sounds like you don't think it imposes any, in practice. Note that in the discussion of C6, the standard says, - Ideally, an implementation would *always* interpret two canonical-equivalent sequences *identically*. There are practical circumstances under which implementations may reasonably distinguish them. (Emphasis mine.) The examples given are things like "inspecting memory representation structure" (which properly speaking is really outside of Unicode conformance) and "ignoring collation behavior of combining sequences outside the repertoire of a specified language." That sounds like "Special cases aren't special enough to break the rules. Although practicality beats purity." to me. Treating things differently is an exceptional case, that requires sufficient justification. My understanding is that if those strings are exchanged with an another process, then whether or not treating them differently is allowed depends on whether the results will be output to another process, and what the definition of our process is. Sometimes it will be allowed, but mostly it won't. Take file names as an example. If our process is working with an external process (the OS's file system driver) whose definition includes the statement that "File names are sequences of Unicode characters", then C6 says our process must compare canonically equivalent sequences that it takes to be file names as the same, whether or not they are in the same normalized form, or normalized at all, because we can't assume the file system will treat them as different. If we do treat them as different, our users will get very upset (eg, if we don't signal a duplicate file name input by the user, and then the OS proceeds to overwrite an existing file). Dually, having made the statement that file names are Unicode, C6 says that the OS driver must return the same file given two canonically equivalent strings that happen to have different code points in them, because it may not assume that *we* will treat those strings as different names of different files. *Users* will certainly take the viewpoint that two strings that display the same on their monitor should identify the same file when they use them as file names. Now, I'm *not* saying that Python's strings *should* conform to the Unicode standard in this respect yet (or ever, for that matter; I'm with Guido on that). I'm simply saying that the current implementation of strings, as improved by PEP 393, can not be said to be conforming. I would like to see something much more conformant done as a separate library (the Python Components for Unicode, say), intended to support users who need character-based behavior, Unicode-ly correct collation, etc., more than efficiency. Applications that need both will have to make their own way at first, either by contributing improvements to the library or by using application-specific algorithms. From martin at v.loewis.de Thu Aug 25 11:57:53 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 25 Aug 2011 11:57:53 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55FDAC.9010605@v.loewis.de> <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E561CA1.8020500@v.loewis.de> Am 25.08.2011 11:39, schrieb Stephen J. Turnbull: > "Martin v. L?wis" writes: > > > No, that's explicitly *not* what C6 says. Instead, it says that a > > process that treats s1 and s2 differently shall not assume that others > > will do the same, i.e. that it is ok to treat them the same even though > > they have different code points. Treating them differently is also > > conforming. > > Then what requirement does C6 impose, in your opinion? In IETF terminology, it's a weak SHOULD requirement. Unless there are reasons not to, equivalent strings should be treated differently. It's a weak requirement because the reasons not to treat them equivalent are wide-spread. > - Ideally, an implementation would *always* interpret two > canonical-equivalent sequences *identically*. There are practical > circumstances under which implementations may reasonably distinguish > them. (Emphasis mine.) Ok, so let me put emphasis on *ideally*. They acknowledge that for practical reasons, the equivalent strings may need to be distinguished. > The examples given are things like "inspecting memory representation > structure" (which properly speaking is really outside of Unicode > conformance) and "ignoring collation behavior of combining sequences > outside the repertoire of a specified language." That sounds like > "Special cases aren't special enough to break the rules. Although > practicality beats purity." to me. Treating things differently is an > exceptional case, that requires sufficient justification. And the common justification is efficiency, along with the desire to support the representation of unnormalized strings (else there would be an efficient implementation). > If our process is working with an external process (the OS's file > system driver) whose definition includes the statement that "File > names are sequences of Unicode characters", then C6 says our process > must compare canonically equivalent sequences that it takes to be file > names as the same, whether or not they are in the same normalized > form, or normalized at all, because we can't assume the file system > will treat them as different. It may well happen that this requirement is met in a plain Python application. If the file system and GUI libraries always return NFD strings, then the Python process *will* compare equivalent sequences correctly (since it won't ever get any other representations). > *Users* will certainly take the viewpoint that two strings that > display the same on their monitor should identify the same file when > they use them as file names. Yes, but that's the operating system's choice first of all. Some operating systems do allow file names in a single directory that are equivalent yet use different code points. Python then needs to support this operating system, despite the permission of the Unicode standard to ignore the difference. > I'm simply saying that the current > implementation of strings, as improved by PEP 393, can not be said to > be conforming. I continue to disagree. The Unicode standard deliberately allows Python's behavior as conforming. > I would like to see something much more conformant done as a separate > library (the Python Components for Unicode, say), intended to support > users who need character-based behavior, Unicode-ly correct collation, > etc., more than efficiency. Wrt. normalization, I think all that's needed is already there. Applications just need to normalize all strings to a normal form of their liking, and be done. That's easier than using a separate library throughout the code base (let alone using yet another string type). Regards, Martin From solipsis at pitrou.net Thu Aug 25 13:27:34 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Thu, 25 Aug 2011 13:27:34 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5606C7.9000404@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> Message-ID: <20110825132734.1c236d17@pitrou.net> Hello, On Thu, 25 Aug 2011 10:24:39 +0200 "Martin v. L?wis" wrote: > > On a 32-bit machine with a 32-bit wchar_t, pure-ASCII strings of length > 1 (+NUL) will take the same memory either way: 8 bytes for the > characters in 3.2, 2 bytes in 3.3 + extra pointer + padding. Strings > of 2 or more characters will take more space in 3.2. > > On a 32-bit machine with a 16-bit wchar_t, pure-ASCII strings up > to 3 characters take the same space either way; space savings start at > four characters. > > On a 64-bit machine with a 16-bit wchar_t, assuming a malloc minimum > block size of 16 bytes, pure-ASCII strings of up to 7 characters take > the same space. For 8 characters, 3.2 will need 32 bytes for the > characters, whereas 3.3 will only take 16 bytes (due to padding). That's very good. For future reference, could you add this information to the PEP? > >> - conditions you would like to pose on the implementation before > >> acceptance. I'll see which of these can be resolved, and list > >> the ones that remain open. > > > > That it doesn't significantly slow down benchmarks such as stringbench > > and iobench. > > Can you please quantify "significantly"? Also, having a complete list > of benchmarks to perform prior to acceptance would be helpful. I would say no more than a 15% slowdown on each of the following benchmarks: - stringbench.py -u (http://svn.python.org/view/sandbox/trunk/stringbench/) - iobench.py -t (in Tools/iobench/) - the json_dump, json_load and regex_v8 tests from http://hg.python.org/benchmarks/ I believe these are representative of string-heavy operations. Additionally, it would be nice if you could run at least some of the test_bigmem tests, according to your system's available RAM. Regards Antoine. From ncoghlan at gmail.com Thu Aug 25 13:54:36 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 25 Aug 2011 21:54:36 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E561CA1.8020500@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55FDAC.9010605@v.loewis.de> <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> <4E561CA1.8020500@v.loewis.de> Message-ID: On Thu, Aug 25, 2011 at 7:57 PM, "Martin v. L?wis" wrote: > Am 25.08.2011 11:39, schrieb Stephen J. Turnbull: >> I'm simply saying that the current >> implementation of strings, as improved by PEP 393, can not be said to >> be conforming. > > I continue to disagree. The Unicode standard deliberately allows > Python's behavior as conforming. I'd actually put it slightly differently: it seems to me that Python, in and of itself, can neither conform to nor violate that part of the standard, since conformance depends on how the *application* processes the data. However, we can make it harder or easier for applications to be conformant. UCS2 builds make it harder, since some code points have to be represented as code units internally. UCS4 builds and future PEP 393 builds (which should exhibit current UCS4 build semantics at the Python layer) make it easier, since the internal representation consistently uses code points, with code units only appearing as part of the encoding and decoding process. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From stephen at xemacs.org Thu Aug 25 13:58:30 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 25 Aug 2011 20:58:30 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E561CA1.8020500@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55FDAC.9010605@v.loewis.de> <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> <4E561CA1.8020500@v.loewis.de> Message-ID: <87ipplaejd.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > Am 25.08.2011 11:39, schrieb Stephen J. Turnbull: > > "Martin v. L?wis" writes: > > > > > No, that's explicitly *not* what C6 says. Instead, it says that a > > > process that treats s1 and s2 differently shall not assume that others > > > will do the same, i.e. that it is ok to treat them the same even though > > > they have different code points. Treating them differently is also > > > conforming. > > > > Then what requirement does C6 impose, in your opinion? > > In IETF terminology, it's a weak SHOULD requirement. Unless there are > reasons not to, equivalent strings should be treated differently. It's > a weak requirement because the reasons not to treat them equivalent are > wide-spread. There are no "weak SHOULDs" and no "wide-spread reasons" in RFC 2119. RFC 2119 specifies "particular circumstances" and "full implications" that are "carefully weighed" before varying from SHOULD behavior. IMHO the Unicode Standard intends a full RFC 2119 "SHOULD" here. > Yes, but that's the operating system's choice first of all. Some > operating systems do allow file names in a single directory that > are equivalent yet use different code points. Python then needs to > support this operating system, despite the permission of the > Unicode standard to ignore the difference. Sure, and that's one of several such reasons why I think the PEP's implementation of unicodes as arrays of code points is an optimal balance. But the Unicode standard does not "permit" ignoring the difference here, except in the sense that *the Unicode standard doesn't apply at all* and therefore doesn't forbid it. The OSes in question are not conforming processes, and presumably don't claim to be. Because most of the processes Python interacts with won't be conforming processes (not even the majority of textual applications, for a while), Python does not need to be, and *should not* be, a conforming Unicode process for most of what it does. Not even for much of its text processing. Also, to the extent that Python is a general-purpose language, I see nothing wrong and lots of good in having a non-conformant code point array type as the platform for implementing conforming Unicode library(ies). But this is not user/developer-friendly at all: > Wrt. normalization, I think all that's needed is already there. > Applications just need to normalize all strings to a normal form of > their liking, and be done. That's easier than using a separate library > throughout the code base (let alone using yet another string type). But many users have never heard of normalization. And that's *just* normalization. There is a whole raft of other requirements for conformance (collation, case, etc). The point is that with such a library and string type, various aspects of conformance to Unicode, as well as conformance to associated standards (eg, the dreaded UTS #18 ;-) can be added to the library over time, and most users (those who don't need to squeeze every ounce of performance out of Python) can be blissfully unaware of what, if anything, they're conforming to. Just upgrade the library to get the best Unicode support (in terms of conformance) that Python has to offer. But for the reasons you (and Guido and Nick and ...) give, it's not reasonable to put all that into core Python, not anytime soon. Not to mention that as a work-in-progress, it can hardly be considered stable enough for the stdlib. That is what Terry Reedy is getting at, AIUI. "Batteries included" should mean as much Unicode conformance as we can reasonably provide should be *conveniently* available. The ideal (given the caveat about efficiency) would be *one* import statement and a ConformingUnicode type that acts "just like a string" in all ways, except that (1) it indexes and counts on characters (preferably "grapheme clusters" :-), (2) does collation, regexps, and the like conformant to the Unicode standard, and (3) may be quite inefficient from the point of view of bit- shoveling net applications and the like. Of course most of (2) is going to take quite a while, but (1) and (3) should not be that hard to accomplish (especially (3) ;-). > > I'm simply saying that the current implementation of strings, as > > improved by PEP 393, can not be said to be conforming. > > I continue to disagree. The Unicode standard deliberately allows > Python's behavior as conforming. That's up to you. I doubt very many users or application developers will see it that way, though. I think they would prefer that we be conservative about what we call "conformant", and tell them precisely what they need to do to get what they consider conformant behavior from Python. That's easier if we share definitions of conformant with them. And surely there would be great joy on the battlements if there were a one-import way to spell "all the Unicode conformance you can give me, please". The problem with your legalistic approach, as I see it, is that if our definition is looser than the users', all their surprises will be unpleasant. That's not good. From ncoghlan at gmail.com Thu Aug 25 13:59:31 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 25 Aug 2011 21:59:31 +1000 Subject: [Python-Dev] [Python-checkins] devguide: #12792: document the "type" field of the tracker. In-Reply-To: References: Message-ID: On Tue, Aug 23, 2011 at 7:46 AM, ezio.melotti wrote: > +security > + ? ?Issues that might have security implications. ?If you think the issue > + ? ?should not be made public, please report it to security at python.org instead. A link to http://www.python.org/news/security/ would be handy here, since that has the GPG key to send encrypted messages to the security list. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Thu Aug 25 14:01:00 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 25 Aug 2011 22:01:00 +1000 Subject: [Python-Dev] [Python-checkins] devguide: #12792: document the "type" field of the tracker. In-Reply-To: References: Message-ID: On Thu, Aug 25, 2011 at 9:59 PM, Nick Coghlan wrote: > A link to http://www.python.org/news/security/ would be handy here, > since that has the GPG key to send encrypted messages to the security > list. http://www.python.org/security/ is a better variant of the link, though (it redirects to the security advisory page, but looks nicer) Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From facundobatista at gmail.com Thu Aug 25 14:42:10 2011 From: facundobatista at gmail.com (Facundo Batista) Date: Thu, 25 Aug 2011 09:42:10 -0300 Subject: [Python-Dev] DNS problem with ar.pycon.org Message-ID: Sorry for the crossposting, but I don't know who admins the pycon.org site. it seems that something happened to "ar.pycon.org", it should point to the same IP than "pycon.python.org.ar" (190.228.30.157). Somebody knows who can fix it? BTW, how do I update that page? We're having the third PyCon in Argentina this year... Thank you! -- .? ? Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/ From torsten.becker at gmail.com Thu Aug 25 20:12:04 2011 From: torsten.becker at gmail.com (Torsten Becker) Date: Thu, 25 Aug 2011 14:12:04 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: Okay, I am convinced. :) If Martin does not object, I would change the "void *str" field to union { void *any; unsigned char *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Regards, Torsten On Wed, Aug 24, 2011 at 02:57, Stefan Behnel wrote: > Torsten Becker, 24.08.2011 04:41: >> >> Also, common, now simple, checks for "unicode->str == NULL" would look >> more ambiguous with a union ("unicode->str.latin1 == NULL"). > > You could just add yet another field "any", i.e. > > ? ?union { > ? ? ? unsigned char* latin1; > ? ? ? Py_UCS2* ucs2; > ? ? ? Py_UCS4* ucs4; > ? ? ? void* any; > ? ?} str; > > That way, the above test becomes > > ? ?if (!unicode->str.any) > > or > > ? ?if (unicode->str.any == NULL) > > Or maybe even call it "initialised" to match the intended purpose: > > ? ?if (!unicode->str.initialised) > > That being said, I don't mind "unicode->str.latin1 == NULL" either, given > that it will (as mentioned by others) be hidden behind a macro most of the > time anyway. > > Stefan From stefan_ml at behnel.de Thu Aug 25 20:47:25 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Aug 2011 20:47:25 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E553FBC.7080501@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> Message-ID: "Martin v. L?wis", 24.08.2011 20:15: > - issues to be considered (unclarities, bugs, limitations, ...) A problem of the current implementation is the need for calling PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to insufficient memory). Basically, this means that even something as trivial as trying to get the length of a Unicode string can now result in an error. I just noticed this when rewriting Cython's helper function that searches a unicode string for a (Py_UCS4) character. Previously, the entire function was safe, could never produce an error and therefore always returned a boolean result. In the new world, the caller of this function must check and propagate errors. This may not be a major issue in most cases, but it can have a non-trivial impact on user code, depending on how deep in a call chain this happens and on how much control the user has over the call chain (think of a C callback, for example). Also, even in the case that there is no error, the potential need to build up the string on request means that the run time and memory requirements of an algorithm are less predictable now as they depend on the origin of the input and not just its Python level string content. I would be happier with an implementation that avoided this by always instantiating the data buffer right from the start, instead of carrying only a Py_UNICODE buffer for old-style instances. Stefan From lukasz at langa.pl Thu Aug 25 22:37:30 2011 From: lukasz at langa.pl (=?iso-8859-2?Q?=A3ukasz_Langa?=) Date: Thu, 25 Aug 2011 22:37:30 +0200 Subject: [Python-Dev] Sphinx version for Python 2.x docs In-Reply-To: References: <4E4AF610.5040303@simplistix.co.uk> Message-ID: Wiadomo?? napisana przez Sandro Tosi w dniu 23 sie 2011, o godz. 01:09: > What I want to understand if it's an acceptable change. > > I see sphinx more as of an internal, building tool, so freezing it > it's like saying "don't upgrade gcc" or so. Normally I'd say it's natural for us to specify that for a legacy release we're using build tools in versions up to so-and-so. Plus, requiring changes in the repository additionally points that this is indeed touching "frozen" code. In case of 2.7 though, it's our "LTS release" so I think if Georg agrees, I'm also in favor of the upgrade. As for Sphinx using svn.python.org, the main issue is not altering the scripts to use Hg, it's the weight of the whole Sphinx repository that would have to be cloned for each distclean. By using SVN you're only downloading a specifically tagged source tree. -- Best regards, ?ukasz Langa Senior Systems Architecture Engineer IT Infrastructure Department Grupa Allegro Sp. z o.o. Pomy?l o ?rodowisku naturalnym zanim wydrukujesz t? wiadomo??! Please consider the environment before printing out this e-mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 1898 bytes Desc: not available URL: From stefan_ml at behnel.de Thu Aug 25 23:30:13 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Aug 2011 23:30:13 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> Message-ID: Stefan Behnel, 25.08.2011 20:47: > "Martin v. L?wis", 24.08.2011 20:15: >> - issues to be considered (unclarities, bugs, limitations, ...) > > A problem of the current implementation is the need for calling > PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to > insufficient memory). Basically, this means that even something as trivial > as trying to get the length of a Unicode string can now result in an error. Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there is *any* code out there that expects this macro to ever return NULL. This means that the current implementation has actually broken the old API. Just allocate an "80% of your memory" long string using the new API and then call PyUnicode_AS_UNICODE() on it to see what I mean. Sadly, a quick look at a couple of recent commits in the pep-393 branch suggested that it is not even always obvious to you as the authors which macros can be called safely and which cannot. I immediately spotted a bug in one of the updated core functions (unicode_repr, IIRC) where PyUnicode_GET_LENGTH() is called without a previous call to PyUnicode_FAST_READY(). I find it everything but obvious that calling PyUnicode_DATA() and PyUnicode_KIND() is safe as long as the return value is being checked for errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a previous call to PyUnicode_Ready(). > I just noticed this when rewriting Cython's helper function that searches a > unicode string for a (Py_UCS4) character. Previously, the entire function > was safe, could never produce an error and therefore always returned a > boolean result. In the new world, the caller of this function must check > and propagate errors. This may not be a major issue in most cases, but it > can have a non-trivial impact on user code, depending on how deep in a call > chain this happens and on how much control the user has over the call chain > (think of a C callback, for example). > > Also, even in the case that there is no error, the potential need to build > up the string on request means that the run time and memory requirements of > an algorithm are less predictable now as they depend on the origin of the > input and not just its Python level string content. > > I would be happier with an implementation that avoided this by always > instantiating the data buffer right from the start, instead of carrying > only a Py_UNICODE buffer for old-style instances. Stefan From guido at python.org Thu Aug 25 23:55:22 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 14:55:22 -0700 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5606C7.9000404@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> Message-ID: On Thu, Aug 25, 2011 at 1:24 AM, "Martin v. L?wis" wrote: >> With this PEP, the unicode object overhead grows to 10 pointer-sized >> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. >> Does it have any adverse effects? > > If I count correctly, it's only three *additional* words (compared to > 3.2): four new ones, minus one that is removed. In addition, it drops > a memory block. Assuming a malloc overhead of two pointers per malloc > block, we get one additional pointer. [...] But strings are allocated via PyObject_Malloc(), i.e. the custom arena-based allocator -- isn't its overhead (for small objects) less than 2 pointers per block? -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 00:29:34 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 15:29:34 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E53A87A.1070306@v.loewis.de> <20110823160820.08754ffe@pitrou.net> Message-ID: On Tue, Aug 23, 2011 at 7:41 PM, Torsten Becker wrote: > On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou wrote: >> Macros are useful to shield the abstraction from the implementation. If >> you access the members directly, and the unicode object is represented >> differently in some future version of Python (say e.g. with tagged >> pointers), your code doesn't compile anymore. > > I agree with Antoine, from the experience of porting C code from 3.2 > to the PEP 393 unicode API, the additional encapsulation by macros > made it much easier to change the implementation of what is a field, > what is a field's actual name, and what needs to be calculated through > a function. > > So, I would like to keep primary access as a macro but I see the point > that it would make the struct clearer to access and I would not mind > changing the struct to use a union. ?But then most access currently is > through macros so I am not sure how much benefit the union would bring > as it mostly complicates the struct definition. +1 > Also, common, now simple, checks for "unicode->str == NULL" would look > more ambiguous with a union ("unicode->str.latin1 == NULL"). You could add an extra union field for that: unicode->str.voidptr == NULL -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 00:44:38 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 15:44:38 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull wrote: > Well, no, it gives the right answer according to the design. ?unicode > objects do not contain character strings. ?By design, they contain > code point strings. ?Guido has made that absolutely clear on a number > of occasions. Actually, the situation is that in narrow builds, they contain code units (which may have surrogates); in wide builds they contain code points. I think this is the crux of Tom Christian's complaints about narrow builds. Here's proof that narrow builds contain code units, not code points (i.e. use UTF-16, not UCS-2): $ ./python Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 65535 >>> a = u'\U00012345' >>> a u'\U00012345' >>> len(a) 2 >>> It's pretty clear that the interpreter is surrogate-aware, which to me indicates the use of UTF-16. Now in the PEP 393 branch: ./python Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 1114111 >>> a = '\U00012345' >>> a '?' >>> len(a) 1 >>> And some proof that this branch does not care about surrogates: >>> a = '\ud808' >>> b = '\udf45' >>> a '\ud808' >>> b '\udf45' >>> a + b '\ud808\udf45' >>> len(a+b) 2 >>> However: a = '\ud808\udf45' >>> a '?' >>> len(a) 1 >>> Which to me merely shows it is smart when parsing string literals. (I expect that regular 3.3 narrow builds behave similar to the 2.7 narrow build, and 3.3 wide builds behave similar to the pep-393 build; I didn't have those lying around.) -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 00:54:03 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 15:54:03 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy wrote: > Excuse me for believing the fine 3.2 manual that says > "Strings contain Unicode characters." (And to a naive reader, that implies > that string iteration and indexing should produce Unicode characters.) The naive reader also doesn't know the difference between characters, code points and code units. It's the advanced, Unicode-aware reader who is confused by this phrase in the docs. It should say code units; or perhaps code units for narrow builds and code points for wide builds. With PEP 393 we can unconditionally say code points, which is much better. We should try to remove our use of "characters" -- or else we should *define* our use of the term "characters" as "what the Unicode standard calls code points". -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 01:10:02 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 16:10:02 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5538B7.8010709@haypocalc.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> Message-ID: [Apologies for sending out a long stream of pointed responses, written before I have fully digested this entire mega-thread. I don't have the patience today to collect them all into a single mega-response.] On Wed, Aug 24, 2011 at 10:45 AM, Victor Stinner wrote: > Note: Java and the Qt library use also UTF-16 strings and have exactly the > same "limitations" for str[n] and len(str). Which reminds me. The PEP does not say what other Python implementations besides CPython should do. presumably Jython and IronPython will continue to use UTF-16, so presumably the language reference will still have to document that strings contain code units (not code points) and the objections Tom Christiansen raised against this will remain true for those versions of Python. (I don't know about PyPy, they can presumably decide when they start their Py3k port.) OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and we can lay the narrow build issues to rest? Can someone here speak for them? -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 01:31:02 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 16:31:02 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5539D5.60500@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5539D5.60500@v.loewis.de> Message-ID: On Wed, Aug 24, 2011 at 10:50 AM, "Martin v. L?wis" wrote: > Not with these words, though. As I recall, it's rather like (still > with different words) "len() will stay O(1) forever, regardless of > any perceived incorrectness of this choice". And indexing/slicing will also be O(1). > An attempt to change > the builtins to introduce higher complexity for the sake of correctness > is what he rejects. I think PEP 393 balances this well, keeping > the O(1) operations in that complexity, while improving the cross- > platform "correctness" of these functions. +1, I am comfortable with the balance struck by the PEP. -- --Guido van Rossum (python.org/~guido) From ezio.melotti at gmail.com Fri Aug 26 02:00:10 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Fri, 26 Aug 2011 03:00:10 +0300 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> Message-ID: On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy wrote: > On 8/24/2011 1:45 PM, Victor Stinner wrote: > >> Le 24/08/2011 02:46, Terry Reedy a ?crit : >> > > I don't think that using UTF-16 with surrogate pairs is really a big >> problem. A lot of work has been done to hide this. For example, >> repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. >> Ezio fixed recently str.is*() methods in Python 3.2+. >> > > I greatly appreciate that he did. The * (lower,upper,title) methods > apparently are not fixed yet as the corresponding new tests are currently > skipped for narrow builds. There are two reasons for this: 1) the str.is* methods get the string and return True/False, so it's enough to iterate on the string, combine the surrogates, and check if the result islower/upper/etc. Methods like lower/upper/etc, afaiu, currently get only a copy of the string, and modify that in place. The current macros advance to the next char during reading and writing, so it's not possible to use them to read/write from/to the same string. We could either change the macros to not advance the pointer [0] (and do it manually in the other functions like is*) or change the function to get the original string too. 2) I'm on vacation. Best Regards, Ezio Melotti [0]: for lower/upper/title it should be possible to modify the string in place, because these operations never converts a non-BMP char to a BMP one (and vice versa), so if two surrogates are read, two surrogates will be written after the transformation. I'm not sure this will work with all the methods though (e.g. str.translate). -------------- next part -------------- An HTML attachment was scrubbed... URL: From dinov at microsoft.com Fri Aug 26 02:01:42 2011 From: dinov at microsoft.com (Dino Viehland) Date: Fri, 26 Aug 2011 00:01:42 +0000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> Message-ID: <6C7ABA8B4E309440B857D74348836F2E28F378E2@TK5EX14MBXC292.redmond.corp.microsoft.com> Guido wrote: > Which reminds me. The PEP does not say what other Python > implementations besides CPython should do. presumably Jython and > IronPython will continue to use UTF-16, so presumably the language > reference will still have to document that strings contain code units (not code > points) and the objections Tom Christiansen raised against this will remain > true for those versions of Python. (I don't know about PyPy, they can > presumably decide when they start their Py3k > port.) > > OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and > we can lay the narrow build issues to rest? Can someone here speak for > them? The biggest difficulty for IronPython here would be dealing w/ .NET interop. We can certainly introduce either an IronPython specific string class which is similar to CPython's PyUnicodeObject or we could have multiple distinct .NET types (IronPython.Runtime.AsciiString, System.String, and IronPython.Runtime.Ucs4String) which all appear as the same type to Python. But when Python is calling a .NET API it's always going to return a System.String which is UTF-16. If we had to check and convert all of those strings when they cross into Python it would be very bad for performance. Presumably we could have a 4th type of "interop" string which lazily computes this but if we start wrapping .Net strings we could also get into object identity issues. We could stop using System.String in IronPython all together and say when working w/ .NET strings you get the .NET behavior and when working w/ Python strings you get the Python behavior. I'm not sure how weird and confusing that would be but conversion from an Ipy string to a .NET string could remain cheap if both were UTF-16, and conversions from .NET strings to Ipy strings would only happen if the user did so explicitly. But it's a huge change - it'll almost certainly touch every single source file in IronPython. I would think we'd get 3.2 done first and then think about what to do here. From guido at python.org Fri Aug 26 02:26:53 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 17:26:53 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E55C2C3.3060205@canterbury.ac.nz> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing wrote: > What about things like the surrogateescape codec that > deliberately use code units in non-standard ways? Will > tricks like that still be possible if the code-unit > level is hidden from the programmer? I would think that it should still be possible to explicitly put surrogates into a string, using the appropriate \uxxxx escape or chr(i) or some such approach; the basic string operations IMO shouldn't bother with checking for well-formed character sequences (just as they shouldn't care about normal forms). But decoding bytes from UTF-16 should not leave any surrogate pairs in, since interpreting those is part of the decoding. I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it). -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 02:34:44 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 17:34:44 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55FDAC.9010605@v.loewis.de> <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Thu, Aug 25, 2011 at 2:39 AM, Stephen J. Turnbull wrote: > If our process is working with an external process (the OS's file > system driver) whose definition includes the statement that "File > names are sequences of Unicode characters", Does any OS actually say that? Don't they usually say "in a specific normal form" or "they're just bytes"? > then C6 says our process > must compare canonically equivalent sequences that it takes to be file > names as the same, whether or not they are in the same normalized > form, or normalized at all, because we can't assume the file system > will treat them as different. ?If we do treat them as different, our > users will get very upset (eg, if we don't signal a duplicate file > name input by the user, and then the OS proceeds to overwrite an > existing file). The solution here is to let the OS do the check, e.g. with os.path.exists() or os.stat(). It would be wrong to write an app that checked for file existence by doing naive lookups in os.listdir() output. -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 02:40:22 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 17:40:22 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87ipplaejd.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <87ty969ljy.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55FDAC.9010605@v.loewis.de> <87liuhakyl.fsf@uwakimon.sk.tsukuba.ac.jp> <4E561CA1.8020500@v.loewis.de> <87ipplaejd.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Thu, Aug 25, 2011 at 4:58 AM, Stephen J. Turnbull wrote: > The problem with your legalistic approach, as I see it, is that if our > definition is looser than the users', all their surprises will be > unpleasant. ?That's not good. I see no alternative to explicitly spelling out what all operations do and let the user figure out whether that meets their needs. E.g. we needn't say that the str type or its == operator conforms to the Unicode standard. We just need to say that the string type is a sequence of code points, that string operations don't do validation or normalization, and that to do a comparison that takes the Unicode std's definition of equivalence (or collation, etc.) into account you must call a certain library method. -- --Guido van Rossum (python.org/~guido) From ezio.melotti at gmail.com Fri Aug 26 03:40:33 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Fri, 26 Aug 2011 04:40:33 +0300 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum wrote: > On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy wrote: > > Excuse me for believing the fine 3.2 manual that says > > "Strings contain Unicode characters." (And to a naive reader, that > implies > > that string iteration and indexing should produce Unicode characters.) > > The naive reader also doesn't know the difference between characters, > code points and code units. It's the advanced, Unicode-aware reader > who is confused by this phrase in the docs. It should say code units; > or perhaps code units for narrow builds and code points for wide > builds. For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be correct. Also note that: * for both, every "code unit" has a specific "codepoint" (including lone surrogates), so it might be OK to talk about "codepoints" too, but * only for wide builds every "codepoints" is represented by a single, 32-bits "code unit". In narrow builds, non-BMP chars are represented by a "code unit sequence" of two elements (i.e. a "surrogate pair"). Since "code unit" refers to the *minimal* bit combination, in UTF-8 characters that needs 2/3/4 bytes, are represented with a "code unit sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code points" overlaps only for the ASCII range). > With PEP 393 we can unconditionally say code points, which is > much better. We should try to remove our use of "characters" -- or > else we should *define* our use of the term "characters" as "what the > Unicode standard calls code points". > Character usually works fine, especially for naive readers. Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me. Note that there's also another important term[1]: """ *Unicode Scalar Value*. Any Unicode * code point * except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. """ For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 bits) that represent "scalar values"[2][3]: Chapter 3 [4] says: """ 3.9 Unicode Encoding Forms The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. ? As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. """ On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates. I hope this helps clarify the terminology a bit and doesn't add more confusion, but if we want to use the Unicode terms we should get them right. (Also note that I might have misunderstood something, even if I've been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.) Best Regards, Ezio Melotti [0]: From the chapter 3 [4], D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. ? Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units?that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. [1]: http://unicode.org/glossary/#unicode_scalar_value [2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn't for UTF-16 and UTF-32. >From the chapter 3 [4], D91: "Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed." D92: "Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed." I think this should be fixed. [3]: Note that I'm talking about codecs used to encode/decode Unicode strings to/from bytes here, it's perfectly fine for Python itself to represent lone surrogates in its *internal* representations, regardless of what encoding it's using. [4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ijmorlan at uwaterloo.ca Fri Aug 26 04:28:06 2011 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Thu, 25 Aug 2011 22:28:06 -0400 (EDT) Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: On Thu, 25 Aug 2011, Guido van Rossum wrote: > I'm not sure what should happen with UTF-8 when it (in flagrant > violation of the standard, I presume) contains two separately-encoded > surrogates forming a valid surrogate pair; probably whatever the UTF-8 > codec does on a wide build today should be good enough. Similarly for > encoding to UTF-8 on a wide build if one managed to create a string > containing a surrogate pair. Basically, I'm for a > garbage-in-garbage-out approach (with separate library functions to > detect garbage if the app is worried about it). If it's called UTF-8, there is no decision to be taken as to decoder behaviour - any byte sequence not permitted by the Unicode standard must result in an error (although, of course, *how* the error is to be reported could legitimately be the subject of endless discussion). There are security implications to violating the standard so this isn't just legalistic purity. Hmmm, doesn't look good: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> '\xed\xb0\x80'.decode ('utf-8') u'\udc00' >>> Incorrect! Although this is a narrow build - I can't say what the wide build would do. For reasons of practicality, it may be appropriate to provide easy access to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be called UTF-8. Other variations may also find use if provided. See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From guido at python.org Fri Aug 26 04:52:09 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 19:52:09 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Thu, Aug 25, 2011 at 6:40 PM, Ezio Melotti wrote: > On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum wrote: >> >> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy wrote: >> > Excuse me for believing the fine 3.2 manual that says >> > "Strings contain Unicode characters." (And to a naive reader, that >> > implies >> > that string iteration and indexing should produce Unicode characters.) >> >> The naive reader also doesn't know the difference between characters, >> code points and code units. It's the advanced, Unicode-aware reader >> who is confused by this phrase in the docs. It should say code units; >> or perhaps code units for narrow builds and code points for wide >> builds. > > For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be > correct.? Also note that: > ? * for both, every "code unit" has a specific "codepoint" (including lone > surrogates), so it might be OK to talk about "codepoints" too, but > ? * only for wide builds every "codepoints" is represented by a single, > 32-bits "code unit".? In narrow builds, non-BMP chars are represented by a > "code unit sequence" of two elements (i.e. a "surrogate pair"). The more I think about it the more it seems to me that the biggest problem is that in narrow builds it is ambiguous whether (unicode) strings contain code units, i.e. are *encoded* code points, or whether they contain (decoded) code points. In a sense this is repeating the ambiguity of 8-bit strings in Python 2, which are sometimes assumed to contain ASCII or Latin-1 (i.e., code points with a limited range) or UTF-8 (i.e., code units). I know that by now I am repeating myself, but I think it would be really good if we could get rid of this ambiguity. PEP 393 seems the best way forward, even if it doesn't directly address what to do for IronPython or Jython, both of which have to deal with a pervasive native string type that contains UTF-16. IIUC, CPython on Windows will work just fine with PEP 393, even if it means that there is a bit more translation between Python strings and the OS native wchar_t[] type. I assume that the data volumes going through the OS APIs is relatively constrained, since data actually written to or read from a file will still be bytes, possibly run through a codec (if it's a text file), and not go through one of the wchar_t[] APIs -- the latter are used for things like filenames, which are much smaller. > Since "code unit" refers to the *minimal* bit combination, in UTF-8 > characters that needs 2/3/4 bytes, are represented with a "code unit > sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code > points" overlaps only for the ASCII range). Actually I think UTF-8 is best thought of as an encoding for code points, not characters -- the subtle difference between these two should be of no concern to the UTF-8 codec (unless it is a validating codec). >> With PEP 393 we can unconditionally say code points, which is >> much better. We should try to remove our use of "characters" -- or >> else we should *define* our use of the term "characters" as "what the >> Unicode standard calls code points". > > Character usually works fine, especially for naive readers.? Even > Unicode-aware readers often confuse between the several terms, so using a > simple term and pointing to a more accurate description sounds like a better > idea to me. We may well have no choice -- there is just too much documentation that naively refers to characters while really referring to code units or code points. > Note that there's also another important term[1]: > """ > Unicode Scalar Value. Any Unicode code point except high-surrogate and > low-surrogate code points. In other words, the ranges of integers 0 to > D7FF16 and E00016 to 10FFFF16 inclusive. > """ This seems to involve validation. I think all validation should be sequestered to specific APIs (e.g. certain codecs) and the string type should not care about it. Depending on what they are doing, applications may have to be aware of many subtleties in order to always avoid generating "invalid" (or not well-formed-- what's the difference?) strings. > For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 > bits) that represent "scalar values"[2][3]: > > Chapter 3 [4] says: > """ > 3.9 Unicode Encoding Forms > The Unicode Standard supports three character encoding forms: UTF-32, > UTF-16, and UTF-8. Each encoding form maps the Unicode code points > U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] I really don't mind whether our codecs actually make exceptions for surrogates (lone or otherwise). The only requirement I care about is that surrogate-free strings round-trip correctly. Again, apps that want to conform to the requirements regarding surrogates can implement their own validation, and certainly at some point we should offer a validation library as part of the stdlib -- but it should be up to the app whether and when to use it. > ?D76 Unicode scalar value: Any Unicode code point except high-surrogate and > low-surrogate code points. > ???? ? As a result of this definition, the set of Unicode scalar values > consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. > ?D77 Code unit: The minimal bit combination that can represent a unit of > encoded text for processing or interchange. > [...] > ?D79 A Unicode encoding form assigns each Unicode scalar value to a unique > code unit sequence. > """ > > On the other hand, Python Unicode strings are not limited to scalar values, > because they can also contain lone surrogates. Right. > I hope this helps clarify the terminology a bit and doesn't add more > confusion, but if we want to use the Unicode terms we should get them > right.? (Also note that I might have misunderstood something, even if I've > been careful with the terms and I double-checked and quoted the relevant > parts of the Unicode standard.) I'm not more confused than I was, but I think we should reduce the number of Unicode terms we care about rather than increase them. If we only ever had to talk about code points and encoded byte sequences I'd be happy -- although in practice we also need to acknowledge the existence of characters that may be represented by multiple code points, since islower(), lower() etc. may need these (and also the re module). Other concepts we may have to at least acknowledge include various normal forms, equivalence, and collation sequences (which are language-dependent?). It would be lovely if someone wrote up an informational PEP so that we don't all have to lug around a copy of the Unicode standard. > Best Regards, > Ezio Melotti > > > [0]: From the chapter 3 [4], > ?D77 Code unit: The minimal bit combination that can represent a unit of > encoded text for processing or interchange. > ?? ? Code units are particular units of computer storage. Other character > encoding standards typically use code units defined as 8-bit units?that is, > octets. > ? ?? The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, > 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the > UTF-32 encoding form. > [1]: http://unicode.org/glossary/#unicode_scalar_value > [2]: Apparently Python 3 raises an error while encoding lone surrogates in > UTF-8, but it doesn't for UTF-16 and UTF-32. > From the chapter 3 [4], > ?D91: "Because surrogate code points are not Unicode scalar values, isolated > UTF-16 code units in the range 0xD800..0xDFFF are ill-formed." > ?D92: "Because surrogate code points are not included in the set of Unicode > scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are > ill-formed." > I think this should be fixed. > [3]: Note that I'm talking about codecs used to encode/decode Unicode > strings to/from bytes here, it's perfectly fine for Python itself to > represent lone surrogates in its *internal* representations, regardless of > what encoding it's using. > [4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 04:59:10 2011 From: guido at python.org (Guido van Rossum) Date: Thu, 25 Aug 2011 19:59:10 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland wrote: > On Thu, 25 Aug 2011, Guido van Rossum wrote: > >> I'm not sure what should happen with UTF-8 when it (in flagrant >> violation of the standard, I presume) contains two separately-encoded >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 >> codec does on a wide build today should be good enough. Similarly for >> encoding to UTF-8 on a wide build if one managed to create a string >> containing a surrogate pair. Basically, I'm for a >> garbage-in-garbage-out approach (with separate library functions to >> detect garbage if the app is worried about it). > > If it's called UTF-8, there is no decision to be taken as to decoder > behaviour - any byte sequence not permitted by the Unicode standard must > result in an error (although, of course, *how* the error is to be reported > could legitimately be the subject of endless discussion). ?There are > security implications to violating the standard so this isn't just > legalistic purity. You have a point. The security issues cannot be seen separate from all the other issues. The folks inside Google who care about Unicode often harp on this. So I stand corrected. I am fine with codecs treating code points or code point sequences that the Unicode standard doesn't like (e.g. lone surrogates) the same way as more severe errors in the encoded bytes (lots of byte sequences already aren't valid UTF-8). I just hope this doesn't require normal forms or other expensive operations; I hope it's limited to rejecting invalid use of surrogates or other values that are not valid code points (e.g. 0, or >= 2**21). > Hmmm, doesn't look good: > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> >>>> '\xed\xb0\x80'.decode ('utf-8') > > u'\udc00' >>>> > > Incorrect! ?Although this is a narrow build - I can't say what the wide > build would do. > > For reasons of practicality, it may be appropriate to provide easy access to > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be > called UTF-8. ?Other variations may also find use if provided. > > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt > > And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Thanks for the links! I also like the term "supplemental character" (a code point >= 2**16). And I note that they talk about characters were we've just agreed that we should say code points... -- --Guido van Rossum (python.org/~guido) From andrew.pennebaker at gmail.com Fri Aug 26 06:04:10 2011 From: andrew.pennebaker at gmail.com (Andrew Pennebaker) Date: Fri, 26 Aug 2011 00:04:10 -0400 Subject: [Python-Dev] Windows installers and %PATH% Message-ID: Please have the Windows installers add the Python installation directory to the PATH environment variable. Many newbies dive in without knowing that they must manually add C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that should have been done by the installers way back in Python 1.0. Please also add PYTHONROOT\Scripts. It's where cool things like easy_install.exe are stored. More yak shaving. The only potential downside to this is upsetting users who manage multiple python installations. It's not a problem: they already manually adjust PATH to their liking. Cheers, Andrew Pennebaker www.yellosoft.us -------------- next part -------------- An HTML attachment was scrubbed... URL: From jxo6948 at rit.edu Fri Aug 26 06:07:20 2011 From: jxo6948 at rit.edu (John O'Connor) Date: Thu, 25 Aug 2011 21:07:20 -0700 Subject: [Python-Dev] Windows installers and %PATH% In-Reply-To: References: Message-ID: + 0 for automatically adding to %PATH% + 1 for providing an option to the user during install - John On Thu, Aug 25, 2011 at 9:04 PM, Andrew Pennebaker < andrew.pennebaker at gmail.com> wrote: > Please have the Windows installers add the Python installation directory to > the PATH environment variable. > > Many newbies dive in without knowing that they must manually add > C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that > should have been done by the installers way back in Python 1.0. > > Please also add PYTHONROOT\Scripts. It's where cool things like > easy_install.exe are stored. More yak shaving. > > The only potential downside to this is upsetting users who manage multiple > python installations. It's not a problem: they already manually adjust PATH > to their liking. > > Cheers, > > Andrew Pennebaker > www.yellosoft.us > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/tehjcon%40gmail.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Fri Aug 26 06:35:26 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 06:35:26 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: Isaac Morland, 26.08.2011 04:28: > On Thu, 25 Aug 2011, Guido van Rossum wrote: >> I'm not sure what should happen with UTF-8 when it (in flagrant >> violation of the standard, I presume) contains two separately-encoded >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 >> codec does on a wide build today should be good enough. Similarly for >> encoding to UTF-8 on a wide build if one managed to create a string >> containing a surrogate pair. Basically, I'm for a >> garbage-in-garbage-out approach (with separate library functions to >> detect garbage if the app is worried about it). > > If it's called UTF-8, there is no decision to be taken as to decoder > behaviour - any byte sequence not permitted by the Unicode standard must > result in an error (although, of course, *how* the error is to be reported > could legitimately be the subject of endless discussion). There are > security implications to violating the standard so this isn't just > legalistic purity. > > Hmmm, doesn't look good: > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> '\xed\xb0\x80'.decode ('utf-8') > u'\udc00' > >>> > > Incorrect! Although this is a narrow build - I can't say what the wide > build would do. Works the same for me in a wide Py2.7 build, but gives me this in Py3: Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> b'\xed\xb0\x80'.decode ('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: illegal encoding Same for current Py3.3 and the PEP393 build (although both have a better exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte"). Stefan From ncoghlan at gmail.com Fri Aug 26 06:52:07 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 26 Aug 2011 14:52:07 +1000 Subject: [Python-Dev] Windows installers and %PATH% In-Reply-To: References: Message-ID: On Fri, Aug 26, 2011 at 2:04 PM, Andrew Pennebaker wrote: > Please have the Windows installers add the Python installation directory to > the PATH environment variable. Please read PEP 397: Python Launcher for Windows. Or at least do us the courtesy of acknowledging that if the issue was as simple as "just munge the PATH", it would have been done long ago. Windows is a developer hostile platform unless you completely buy into the Microsoft toolchain, which is not an option for cross-platform projects like Python. It's well within Microsoft's capabilities to create and support a POSIX compatibility layer that allows applications to look and feel like native ones, but they choose not to, since they see cross-platform development as a competitive threat to their desktop dominance. There's a reason many open source projects don't offer native support at all, instead instructing people to use Cygwin as a compatibility layer. It irks me greatly when people place the blame for this situation on volunteer programmers giving them stuff for free instead of where it belongs (i.e. on the multibillion dollar corporation deliberately failing to implement a widely recognised OS interoperability standard). Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From stefan_ml at behnel.de Fri Aug 26 07:21:11 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 07:21:11 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> Message-ID: Stefan Behnel, 25.08.2011 23:30: > Stefan Behnel, 25.08.2011 20:47: >> "Martin v. L?wis", 24.08.2011 20:15: >>> - issues to be considered (unclarities, bugs, limitations, ...) >> >> A problem of the current implementation is the need for calling >> PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to >> insufficient memory). Basically, this means that even something as trivial >> as trying to get the length of a Unicode string can now result in an error. > > Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there > is *any* code out there that expects this macro to ever return NULL. This > means that the current implementation has actually broken the old API. Just > allocate an "80% of your memory" long string using the new API and then > call PyUnicode_AS_UNICODE() on it to see what I mean. > > Sadly, a quick look at a couple of recent commits in the pep-393 branch > suggested that it is not even always obvious to you as the authors which > macros can be called safely and which cannot. I immediately spotted a bug > in one of the updated core functions (unicode_repr, IIRC) where > PyUnicode_GET_LENGTH() is called without a previous call to > PyUnicode_FAST_READY(). > > I find it everything but obvious that calling PyUnicode_DATA() and > PyUnicode_KIND() is safe as long as the return value is being checked for > errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a > previous call to PyUnicode_Ready(). And, adding to my own mail yet another time, the current header file states this: """ /* String contains only wstr byte characters. This is only possible when the string was created with a legacy API and PyUnicode_Ready() has not been called yet. Note that PyUnicode_KIND() calls PyUnicode_FAST_READY() so PyUnicode_WCHAR_KIND is only possible as a intialized value not as a result of PyUnicode_KIND(). */ #define PyUnicode_WCHAR_KIND 0 """ From my understanding, this is incorrect. When I call PyUnicode_KIND() on an old style object and it fails to allocate the string buffer, I would expect that I actually get PyUnicode_WCHAR_KIND back as a result, as the SSTATE_KIND_* value in the "state" field has not been initialised yet at that point. Stefan From nir at winpdb.org Fri Aug 26 09:18:15 2011 From: nir at winpdb.org (Nir Aides) Date: Fri, 26 Aug 2011 10:18:15 +0300 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: <1314131362.3485.36.camel@localhost.localdomain> References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> Message-ID: Another face of the discussion is about whether to deprecate the mixing of the threading and processing modules and what to do about the multiprocessing module which is implemented with worker threads. On Tue, Aug 23, 2011 at 11:29 PM, Antoine Pitrou wrote: > Le mardi 23 ao?t 2011 ? 22:07 +0200, Charles-Fran?ois Natali a ?crit : > > 2011/8/23 Antoine Pitrou : > > > Well, I would consider the I/O locks the most glaring problem. Right > > > now, your program can freeze if you happen to do a fork() while e.g. > > > the stderr lock is taken by another thread (which is quite common when > > > debugging). > > > > Indeed. > > To solve this, a similar mechanism could be used: after fork(), in the > > child process: > > - just reset each I/O lock (destroy/re-create the lock) if we can > > guarantee that the file object is in a consistent state (i.e. that all > > the invariants hold). That's the approach I used in my initial patch. > > For I/O locks I think that would work. > There could also be a process-wide "fork lock" to serialize locks and > other operations, if we want 100% guaranteed consistency of I/O objects > across forks. > > > - call a fileobject method which resets the I/O lock and sets the file > > object to a consistent state (in other word, an atfork handler) > > I fear that the complication with atfork handlers is that you have to > manage their lifecycle as well (i.e., when an IO object is destroyed, > you have to unregister the handler). > > Regards > > Antoine. > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Aug 26 10:29:27 2011 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 26 Aug 2011 09:29:27 +0100 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 26 August 2011 03:52, Guido van Rossum wrote: > I know that by now I am repeating myself, but I think it would be > really good if we could get rid of this ambiguity. PEP 393 seems the > best way forward, even if it doesn't directly address what to do for > IronPython or Jython, both of which have to deal with a pervasive > native string type that contains UTF-16. Hmm, I'm completely naive in this area, but from reading the thread, would a possible approach be to say that Python (the language definition) is defined in terms of code points (as we already do, even if the wording might benefit from some clarification). Then, under PEP 393, and currently in wide builds, CPython conforms to that definition (and retains the property of basic operations being O(1), which is not in the language definition but is a user expectation and your expressed requirement). IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform. Presumably this will be easier than moving to a UCS-4 representation, as they can defer to runtime support routines via interop (which presumably get this right - or at the very least can be blamed for any errors :-)) They lose the O(1) guarantee, but that's easily defensible as a tradeoff to conform to underlying runtime semantics. Does this make sense, or have I completely misunderstood things? Paul. PS Thanks to all for the discussion in general, I'm learning a lot about Unicode from all of this! From mal at egenix.com Fri Aug 26 10:54:09 2011 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 26 Aug 2011 10:54:09 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: <4E575F31.5010709@egenix.com> Stefan Behnel wrote: > Isaac Morland, 26.08.2011 04:28: >> On Thu, 25 Aug 2011, Guido van Rossum wrote: >>> I'm not sure what should happen with UTF-8 when it (in flagrant >>> violation of the standard, I presume) contains two separately-encoded >>> surrogates forming a valid surrogate pair; probably whatever the UTF-8 >>> codec does on a wide build today should be good enough. Similarly for >>> encoding to UTF-8 on a wide build if one managed to create a string >>> containing a surrogate pair. Basically, I'm for a >>> garbage-in-garbage-out approach (with separate library functions to >>> detect garbage if the app is worried about it). >> >> If it's called UTF-8, there is no decision to be taken as to decoder >> behaviour - any byte sequence not permitted by the Unicode standard must >> result in an error (although, of course, *how* the error is to be >> reported >> could legitimately be the subject of endless discussion). There are >> security implications to violating the standard so this isn't just >> legalistic purity. >> >> Hmmm, doesn't look good: >> >> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin >> Type "help", "copyright", "credits" or "license" for more information. >> >>> '\xed\xb0\x80'.decode ('utf-8') >> u'\udc00' >> >>> >> >> Incorrect! Although this is a narrow build - I can't say what the wide >> build would do. > > Works the same for me in a wide Py2.7 build, but gives me this in Py3: > > Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50) > [GCC 4.4.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> b'\xed\xb0\x80'.decode ('utf-8') > Traceback (most recent call last): > File "", line 1, in > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: > illegal encoding > > Same for current Py3.3 and the PEP393 build (although both have a better > exception message now: "UnicodeDecodeError: 'utf8' codec can't decode > bytes in position 0-1: invalid continuation byte"). The reason for this is that the UTF-8 codec in Python 2.x has never rejected lone surrogates and it was used to store Unicode literals in pyc files (using marshal) and also by pickle for transferring Unicode strings, so we could simply reject lone surrogates, since this would have caused compatibility problems. That change was made in Python 3.x by having a special error handler surrogatepass which allows the UTF-8 codec to process lone surrogates as well. BTW: I'd love to join the discussion about PEP 393, but unfortunately I'm swamped with work, so these are just a few comments... What I'm missing in the discussion is statistics of the effects of the patch (both memory and performance) and the effect on 3rd party extensions. I'm not convinced that the memory/speed tradeoff is worth the breakage or whether the patch actually saves memory in real world applications and I'm unsure whether the needed code changes to the binary Python Unicode API can be done in a minor Python release. Note that in the worst case, a PEP 393 Unicode object will save three versions of the same string, e.g. on Windows with sizeof(wchar_t)==2: A UCS4 version in str, a UTF-8 version in utf8 (this gets build whenever Python needs a UTF-8 version of the Object) and a wchar_t version in wstr (which gets build whenever Python codecs or extensions need Py_UNICODE or a wchar_t representation). On all platforms, in the case where you store a Latin-1 non-ASCII string: str holds the Latin-1 string, utf8 the UTF-8 version and wstr the 2- or 4-bytes wchar_t version. * A note on terminology: Python stores Unicode as code points. A Unicode "code point" refers to any value in the Unicode code range which is 0 - 0x10FFFF. Lone surrogates, unassigned and illegal code points are all still code points - this is a detail people often forget. Various code points in Unicode have special meanings and some are not allowed to be used in encodings, but that does not make them rule them out from being stored and processed as code points. Code units are only used in encoded versions Unicode, e.g. the UTF-8, -16, -32. Mixing code units and code points can cause much confusion, so it's better to talk only about code point when referring to Python Unicode objects, since you only ever meet code units when looking at the the bytes output of the codecs. This is important to know, since Python is not only meant to process Unicode, but also to build Unicode strings, so a careful distinction has to be made when considering what is correct and what not: codecs have to follow much more strict rules than Python itself. * A note on surrogates: These are just one particular problem where you run into the situation where splitting a Unicode string potentially breaks a combination of code points. There are a few other types of code points that cause similar problems, e.g. combining code points. Simply going with UCS-4 does not solve the problem, since even with UCS-4 storage, you can still have surrogates in your Python Unicode string. As with many things, it is important to be aware of the potential problem, but there's no automatic fix to get rid of it. What we can do, is make the best of it and this has happened already in many areas, e.g. codecs joining surrogates automatically, chr() creating surrogates, etc. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 26 2011) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 39 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From ezio.melotti at gmail.com Fri Aug 26 11:14:13 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Fri, 26 Aug 2011 12:14:13 +0300 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> Message-ID: On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum wrote: > On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland > wrote: > > On Thu, 25 Aug 2011, Guido van Rossum wrote: > > > >> I'm not sure what should happen with UTF-8 when it (in flagrant > >> violation of the standard, I presume) contains two separately-encoded > >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 > >> codec does on a wide build today should be good enough. > Surrogates are used and valid only in UTF-16. In UTF-8/32 they are invalid, even if they are in pair (see http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ). Of course Python can/should be able to represent them internally regardless of the build type. >>Similarly for > >> encoding to UTF-8 on a wide build if one managed to create a string > >> containing a surrogate pair. Basically, I'm for a > >> garbage-in-garbage-out approach (with separate library functions to > >> detect garbage if the app is worried about it). > > > > If it's called UTF-8, there is no decision to be taken as to decoder > > behaviour - any byte sequence not permitted by the Unicode standard must > > result in an error (although, of course, *how* the error is to be > reported > > could legitimately be the subject of endless discussion). > What do you mean? We use the "strict" error handler by default and we can specify other handlers already. > There are > > security implications to violating the standard so this isn't just > > legalistic purity. > > You have a point. The security issues cannot be seen separate from all > the other issues. The folks inside Google who care about Unicode often > harp on this. So I stand corrected. I am fine with codecs treating > code points or code point sequences that the Unicode standard doesn't > like (e.g. lone surrogates) the same way as more severe errors in the > encoded bytes (lots of byte sequences already aren't valid UTF-8). Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates). We can have other internal codecs that are based on the UTF-* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-*-something' should be ok, see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream."). > I > just hope this doesn't require normal forms or other expensive > operations; I hope it's limited to rejecting invalid use of surrogates > or other values that are not valid code points (e.g. 0, or >= 2**21). > I think there shouldn't be any normalization done automatically by the codecs. > > > Hmmm, doesn't look good: > > > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > > Type "help", "copyright", "credits" or "license" for more information. > >>>> > >>>> '\xed\xb0\x80'.decode ('utf-8') > > > > u'\udc00' > >>>> > > > > Incorrect! Although this is a narrow build - I can't say what the wide > > build would do. > The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see http://bugs.python.org/issue12729#msg142047 ). Luckily this is fixed in Python 3.x. I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me). Best Regards, Ezio Melotti > > > For reasons of practicality, it may be appropriate to provide easy access > to > > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not > be > > called UTF-8. Other variations may also find use if provided. > > > > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt > > > > And CESU-8 technical report: http://www.unicode.org/reports/tr26/ > > Thanks for the links! I also like the term "supplemental character" (a > code point >= 2**16). And I note that they talk about characters were > we've just agreed that we should say code points... > > -- > --Guido van Rossum (python.org/~guido ) > _______________________________________________ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Fri Aug 26 11:29:55 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 26 Aug 2011 11:29:55 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E576793.2010203@v.loewis.de> > IronPython and Jython can retain UTF-16 as their native form if that > makes interop cleaner, but in doing so they need to ensure that basic > operations like indexing and len work in terms of code points, not > code units, if they are to conform. That means that they won't conform, period. There is no efficient maintainable implementation strategy to achieve that property, and it may take well years until somebody provides an efficient unmaintainable implementation. > Does this make sense, or have I completely misunderstood things? You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not. However, non-conformance may not be that much of an issue. They do not conform in many other aspects, either (such as not supporting Python 3, for example, or not supporting the C API) that they may well chose to ignore such a minor requirement if there was one. For BMP strings, they conform fine, and it may well be that Jython eithers either don't have non-BMP strings, or don't care whether len() or indexing of their non-BMP strings is "correct". Regards, Martin From stefan_ml at behnel.de Fri Aug 26 12:29:56 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 12:29:56 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E576793.2010203@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: "Martin v. L?wis", 26.08.2011 11:29: > You seem to assume it is ok for Jython/IronPython to provide indexing in > O(n). It is not. I think we can leave this discussion aside. Jython and IronPython have their own platform specific constraints to which they need to adapt their implementation. For a Jython user, it means a lot to be able to efficiently pass strings (and other data) back and forth between Jython and other JVM code, and it's not hard to guess that the same is true for IronPython/.NET users. After all, the platform integration is the very *reason* for most users to select one of these implementations. Besides, what if these implementations provided indexing in, say, O(log N) instead of O(1) or O(N), e.g. by building a tree index into each string? You could have an index that simply marks runs of surrogate pairs and BMP substrings, thus providing a likely-to-be-somewhat-compact index. That index would obviously have to be built, but so do the different string representations in post-PEP-393 CPython, especially on Windows, as I have learned. Would such a less severe violation of the strict O(1) rule still be "not ok"? I think this is not such a clear black-and-white issue. Both implementations have notably different performance characteristics than CPython in some more or less important areas, as does PyPy. At some point, the language compliance label has to account for that. Stefan From martin at v.loewis.de Fri Aug 26 12:29:29 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 26 Aug 2011 12:29:29 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> Message-ID: <4E577589.4030809@v.loewis.de> > But strings are allocated via PyObject_Malloc(), i.e. the custom > arena-based allocator -- isn't its overhead (for small objects) less > than 2 pointers per block? Ah, right, I missed that. Indeed, those have no header, and the only overhead is the padding to a multiple of 8. That shifts the picture; I hope the table below is correct, assuming ASCII strings. 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems) 393: 10 pointers string | 32-bit pointer | 32-bit pointer | 64-bit pointer size | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t | 3.2 | 393 | 3.2 | 393 | 3.2 | 393 | ----------------------------------------------------------- 1 | 40 | 48 | 40 | 48 | 64 | 88 | 2 | 40 | 48 | 48 | 48 | 72 | 88 | 3 | 40 | 48 | 48 | 48 | 72 | 88 | 4 | 48 | 48 | 56 | 48 | 80 | 88 | 5 | 48 | 48 | 56 | 48 | 80 | 88 | 6 | 48 | 48 | 64 | 48 | 88 | 88 | 7 | 48 | 48 | 64 | 48 | 88 | 88 | 8 | 56 | 56 | 72 | 56 | 96 | 86 | So 1-byte strings increase in size; very short strings increase on 16-bit-wchar_t systems and 64-bit systems. Short strings keep there size, and long strings save. Regards, Martin From solipsis at pitrou.net Fri Aug 26 12:51:30 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 26 Aug 2011 12:51:30 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> <6C7ABA8B4E309440B857D74348836F2E28F378E2@TK5EX14MBXC292.redmond.corp.microsoft.com> Message-ID: <20110826125130.591b142b@pitrou.net> Why would PEP 393 apply to other implementations than CPython? Regards Antoine. On Fri, 26 Aug 2011 00:01:42 +0000 Dino Viehland wrote: > Guido wrote: > > Which reminds me. The PEP does not say what other Python > > implementations besides CPython should do. presumably Jython and > > IronPython will continue to use UTF-16, so presumably the language > > reference will still have to document that strings contain code units (not code > > points) and the objections Tom Christiansen raised against this will remain > > true for those versions of Python. (I don't know about PyPy, they can > > presumably decide when they start their Py3k > > port.) > > > > OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and > > we can lay the narrow build issues to rest? Can someone here speak for > > them? > > The biggest difficulty for IronPython here would be dealing w/ .NET interop. > We can certainly introduce either an IronPython specific string class which > is similar to CPython's PyUnicodeObject or we could have multiple distinct > .NET types (IronPython.Runtime.AsciiString, System.String, and > IronPython.Runtime.Ucs4String) which all appear as the same type to Python. > > But when Python is calling a .NET API it's always going to return a System.String > which is UTF-16. If we had to check and convert all of those strings when they > cross into Python it would be very bad for performance. Presumably we could > have a 4th type of "interop" string which lazily computes this but if we start > wrapping .Net strings we could also get into object identity issues. > > We could stop using System.String in IronPython all together and say when > working w/ .NET strings you get the .NET behavior and when working w/ Python > strings you get the Python behavior. I'm not sure how weird and confusing that > would be but conversion from an Ipy string to a .NET string could remain cheap if > both were UTF-16, and conversions from .NET strings to Ipy strings would only > happen if the user did so explicitly. > > But it's a huge change - it'll almost certainly touch every single source file in > IronPython. I would think we'd get 3.2 done first and then think about what to > do here. > From stefan_ml at behnel.de Fri Aug 26 13:08:54 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 13:08:54 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110826125130.591b142b@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <4E5538B7.8010709@haypocalc.com> <6C7ABA8B4E309440B857D74348836F2E28F378E2@TK5EX14MBXC292.redmond.corp.microsoft.com> <20110826125130.591b142b@pitrou.net> Message-ID: Antoine Pitrou, 26.08.2011 12:51: > Why would PEP 393 apply to other implementations than CPython? Not the PEP itself, just the implications of the result. The question was whether the language specification in a post PEP-393 can (and if so, should) be changed into requiring unicode objects to be defined based on code points. Narrow builds, as well as Jython and IronPython, currently deviate from this as they use UTF-16 as their native string encoding, which, for one, prevents O(1) indexing into characters as well as a direct match between length and character count (minus combining characters etc.). I think this discussion can safely be considered off-topic for this thread (which isn't exactly short enough to keep adding more topics to it). Stefan From solipsis at pitrou.net Fri Aug 26 13:14:33 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 26 Aug 2011 13:14:33 +0200 Subject: [Python-Dev] Windows installers and %PATH% References: Message-ID: <20110826131433.7ab3680d@pitrou.net> On Fri, 26 Aug 2011 14:52:07 +1000 Nick Coghlan wrote: > Windows is a developer hostile platform unless you completely buy into > the Microsoft toolchain, which is not an option for cross-platform > projects like Python. We already buy into the MS toolchain since we require Visual Studio (or at least the command-line tools for building, but I suppose anyone doing serious development on Windows would use the GUI). We also maintain the project files by hand instead of using e.g. cmake. > It's well within Microsoft's capabilities to create and support a > POSIX compatibility layer that allows applications to look and feel > like native ones I have a hard time imagining how a POSIX compatibility layer would make Windows apps feel more "native". It's a matter of fact that Unix and Windows systems function differently. I don't know how much of it can be completely hidden. > the multibillion dollar corporation deliberately > failing to implement a widely recognised OS interoperability > standard I wouldn't call POSIX an OS interoperability standard, but an Unix interoperability standard. It exists because there is so much fragmentation in the Unix world. I doubt MS was invited to the party when POSIX specifications were designed. Windows has its own standards, but since MS is basically the sole OS vendor, they are free to dictate them :) And when I look at the various "POSIX" systems we try to support there: http://www.python.org/dev/buildbot/all/waterfall?category=3.x.stable&category=3.x.unstable I have the feeling that perhaps we spend more time trying to work around incompatibilities, special cases and various levels of (in)compliance among POSIX systems, than implementing the Windows-specific code paths of low-level functions (where the APIs are usually well-defined and very stable). Regards Antoine. From brian.curtin at gmail.com Fri Aug 26 15:40:38 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Fri, 26 Aug 2011 08:40:38 -0500 Subject: [Python-Dev] Windows installers and %PATH% In-Reply-To: References: Message-ID: On Thu, Aug 25, 2011 at 23:04, Andrew Pennebaker < andrew.pennebaker at gmail.com> wrote: > Please have the Windows installers add the Python installation directory to > the PATH environment variable. The http://bugs.python.org bug tracker is a better place for feature requests like this, of which there have been several over the years. This has become a hotter topic lately with several discussions around the community, and a PEP to provide some similar functionality. I've talked with several educators/trainers around and the lack of a Path installation is the #1 thing that bites their newcomers, and it's an issue that bites them before they've even begun to learn. Many newbies dive in without knowing that they must manually add C:\PythonXY > to PATH. It's yak shaving, something perfectly automatable that should have > been done by the installers way back in Python 1.0. > > Please also add PYTHONROOT\Scripts. It's where cool things like > easy_install.exe are stored. More yak shaving. > A clean installation of Python includes no Scripts directory, so I'm not sure we should be polluting the Path with yet-to-exist directories. An approach could be to have packaging optionally add the scripts directory on the installation of a third-party package. The only potential downside to this is upsetting users who manage multiple > python installations. It's not a problem: they already manually adjust PATH > to their liking. > "Users who manage multiple python installations" is probably a very, very large number, so we have quite the audience to appease, and it actually is a problem. We should not go halfway on this feature and say "if it doesn't work perfectly, you're back to being on your own". I think the likely case is that any path addition feature will read the path, then offer to replace existing instances or append to the end. I haven't yet done any work on this, but my todo list for 3.3 includes adding some path related features to the installer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jnoller at gmail.com Fri Aug 26 16:05:09 2011 From: jnoller at gmail.com (Jesse Noller) Date: Fri, 26 Aug 2011 10:05:09 -0400 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> Message-ID: On Fri, Aug 26, 2011 at 3:18 AM, Nir Aides wrote: > Another face of the discussion is about whether to deprecate the mixing of > the threading and processing modules and what to do about the > multiprocessing module which is implemented with worker threads. There's a bug open - http://bugs.python.org/issue8713 which would offer non windows users the ability to avoid using fork() entirely, which would sidestep the problem outlined in the atfork() bug. Under windows, which has no fork() mechanism, we create a subprocess and then use pipes for intercommunication: nothing is inherited from the parent process except the state passed into the child. I think that "deprecating" the use of threads w/ multiprocessing - or at least crippling it is the wrong answer. Multiprocessing needs the helper threads it uses internally to manage queues, etc. Removing that ability would require a near-total rewrite, which is just a non-starter. I'd rather examine bug 8713 more closely, and offer this option for all users in 3.x and document the existing issues outlined in http://bugs.python.org/issue6721 for 2.x - the proposals in that bug are IMHO, out of bounds for a 2.x release. In essence; the issue here is multiprocessing's use of fork on unix without the following exec - which is what the windows implementation essentially does using subprocess. Adding the option to *not* fork changes the fundamental behavior on unix systems - but I fundamentally feel that it's a saner, and more consistent behavior for the module as a whole. So, I'd ask that we not talk about tearing out the ability to use MP and threads, or threads with MP - that would be crippling, and there's existing code in the wild (including multiprocessing itself) that uses this mix without issue - it's stripping out functionality for what is a surprising and painful edge case that rarely directly affects users. I would focus on the atfork() patch more directly, ignoring multiprocessing in the discussion, and focusing on the merits of gps' initial proposal and patch. jesse From guido at python.org Fri Aug 26 16:55:39 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 07:55:39 -0700 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E577589.4030809@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> Message-ID: It would be nice if someone wrote a test to roughly verify these numbers, e.v. by allocating lots of strings of a certain size and measuring the process size before and after (being careful to adjust for the list or other data structure required to keep those objects alive). --Guido On Fri, Aug 26, 2011 at 3:29 AM, "Martin v. L?wis" wrote: >> But strings are allocated via PyObject_Malloc(), i.e. the custom >> arena-based allocator -- isn't its overhead (for small objects) less >> than 2 pointers per block? > > Ah, right, I missed that. Indeed, those have no header, and the only > overhead is the padding to a multiple of 8. > > That shifts the picture; I hope the table below is correct, > assuming ASCII strings. > 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems) > 393: 10 pointers > > string | 32-bit pointer | 32-bit pointer | 64-bit pointer > size ? | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t > ? ? ? | 3.2 ? ? | ?393 | 3.2 ? ?| ?393 ?| 3.2 ? ?| ?393 ?| > ----------------------------------------------------------- > 1 ? ? ?| 40 ? ? ?| 48 ? | 40 ? ? | ?48 ? | 64 ? ? | 88 ? ?| > 2 ? ? ?| 40 ? ? ?| 48 ? | 48 ? ? | ?48 ? | 72 ? ? | 88 ? ?| > 3 ? ? ?| 40 ? ? ?| 48 ? | 48 ? ? | ?48 ? | 72 ? ? | 88 ? ?| > 4 ? ? ?| 48 ? ? ?| 48 ? | 56 ? ? | ?48 ? | 80 ? ? | 88 ? ?| > 5 ? ? ?| 48 ? ? ?| 48 ? | 56 ? ? | ?48 ? | 80 ? ? | 88 ? ?| > 6 ? ? ?| 48 ? ? ?| 48 ? | 64 ? ? | ?48 ? | 88 ? ? | 88 ? ?| > 7 ? ? ?| 48 ? ? ?| 48 ? | 64 ? ? | ?48 ? | 88 ? ? | 88 ? ?| > 8 ? ? ?| 56 ? ? ?| 56 ? | 72 ? ? | ?56 ? | 96 ? ? | 86 ? ?| > > So 1-byte strings increase in size; very short strings increase > on 16-bit-wchar_t systems and 64-bit systems. Short strings > keep there size, and long strings save. > > Regards, > Martin > > > -- --Guido van Rossum (python.org/~guido) From guido at python.org Fri Aug 26 16:56:05 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 07:56:05 -0700 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> Message-ID: Also, please add the table (and the reasoning that led to it) to the PEP. On Fri, Aug 26, 2011 at 7:55 AM, Guido van Rossum wrote: > It would be nice if someone wrote a test to roughly verify these > numbers, e.v. by allocating lots of strings of a certain size and > measuring the process size before and after (being careful to adjust > for the list or other data structure required to keep those objects > alive). > > --Guido > > On Fri, Aug 26, 2011 at 3:29 AM, "Martin v. L?wis" wrote: >>> But strings are allocated via PyObject_Malloc(), i.e. the custom >>> arena-based allocator -- isn't its overhead (for small objects) less >>> than 2 pointers per block? >> >> Ah, right, I missed that. Indeed, those have no header, and the only >> overhead is the padding to a multiple of 8. >> >> That shifts the picture; I hope the table below is correct, >> assuming ASCII strings. >> 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems) >> 393: 10 pointers >> >> string | 32-bit pointer | 32-bit pointer | 64-bit pointer >> size ? | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t >> ? ? ? | 3.2 ? ? | ?393 | 3.2 ? ?| ?393 ?| 3.2 ? ?| ?393 ?| >> ----------------------------------------------------------- >> 1 ? ? ?| 40 ? ? ?| 48 ? | 40 ? ? | ?48 ? | 64 ? ? | 88 ? ?| >> 2 ? ? ?| 40 ? ? ?| 48 ? | 48 ? ? | ?48 ? | 72 ? ? | 88 ? ?| >> 3 ? ? ?| 40 ? ? ?| 48 ? | 48 ? ? | ?48 ? | 72 ? ? | 88 ? ?| >> 4 ? ? ?| 48 ? ? ?| 48 ? | 56 ? ? | ?48 ? | 80 ? ? | 88 ? ?| >> 5 ? ? ?| 48 ? ? ?| 48 ? | 56 ? ? | ?48 ? | 80 ? ? | 88 ? ?| >> 6 ? ? ?| 48 ? ? ?| 48 ? | 64 ? ? | ?48 ? | 88 ? ? | 88 ? ?| >> 7 ? ? ?| 48 ? ? ?| 48 ? | 64 ? ? | ?48 ? | 88 ? ? | 88 ? ?| >> 8 ? ? ?| 56 ? ? ?| 56 ? | 72 ? ? | ?56 ? | 96 ? ? | 86 ? ?| >> >> So 1-byte strings increase in size; very short strings increase >> on 16-bit-wchar_t systems and 64-bit systems. Short strings >> keep there size, and long strings save. >> >> Regards, >> Martin >> >> >> > > > > -- > --Guido van Rossum (python.org/~guido) > -- --Guido van Rossum (python.org/~guido) From stefan_ml at behnel.de Fri Aug 26 17:55:05 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 17:55:05 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> Message-ID: Stefan Behnel, 25.08.2011 23:30: > Sadly, a quick look at a couple of recent commits in the pep-393 branch > suggested that it is not even always obvious to you as the authors which > macros can be called safely and which cannot. I immediately spotted a bug > in one of the updated core functions (unicode_repr, IIRC) where > PyUnicode_GET_LENGTH() is called without a previous call to > PyUnicode_FAST_READY(). Here is another example from unicodeobject.c, commit 56aaa17fc05e: + switch(PyUnicode_KIND(string)) { + case PyUnicode_1BYTE_KIND: + list = ucs1lib_splitlines( + (PyObject*) string, PyUnicode_1BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + case PyUnicode_2BYTE_KIND: + list = ucs2lib_splitlines( + (PyObject*) string, PyUnicode_2BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + case PyUnicode_4BYTE_KIND: + list = ucs4lib_splitlines( + (PyObject*) string, PyUnicode_4BYTE_DATA(string), + PyUnicode_GET_LENGTH(string), keepends); + break; + default: + assert(0); + list = 0; + } The assert(0) at the end will hit when the system is running out of memory while working on a wchar string. Stefan From solipsis at pitrou.net Fri Aug 26 17:53:36 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 26 Aug 2011 17:53:36 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> Message-ID: <20110826175336.3af6be57@pitrou.net> Hi, > I think that "deprecating" the use of threads w/ multiprocessing - or > at least crippling it is the wrong answer. Multiprocessing needs the > helper threads it uses internally to manage queues, etc. Removing that > ability would require a near-total rewrite, which is just a > non-starter. I agree that this wouldn't actually benefit anyone. Besides, I don't think it's even possible to avoid threads in multiprocessing, given the various constraints. We would have to force the user to run their main thread in an event loop, and that would be twisted (tm). > I would focus on the atfork() patch more directly, ignoring > multiprocessing in the discussion, and focusing on the merits of gps' > initial proposal and patch. I think this could also be combined with Charles-Fran?ois' patch. Regards Antoine. From status at bugs.python.org Fri Aug 26 18:07:20 2011 From: status at bugs.python.org (Python tracker) Date: Fri, 26 Aug 2011 18:07:20 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20110826160720.D4C731CA8A@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2011-08-19 - 2011-08-26) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 2963 (+26) closed 21665 (+35) total 24628 (+61) Open issues with patches: 1288 Issues opened (44) ================== #12326: Linux 3: code should avoid using sys.platform == 'linux2' http://bugs.python.org/issue12326 reopened by georg.brandl #12788: test_email fails with -R http://bugs.python.org/issue12788 opened by pitrou #12790: doctest.testmod does not run tests in functools.partial functi http://bugs.python.org/issue12790 opened by stevenjd #12793: allow filters in os.walk http://bugs.python.org/issue12793 opened by Jacek.Pliszka #12795: Remove the major version from sys.platform http://bugs.python.org/issue12795 opened by haypo #12797: io.FileIO and io.open should support openat http://bugs.python.org/issue12797 opened by pitrou #12798: Update mimetypes documentation http://bugs.python.org/issue12798 opened by sandro.tosi #12800: 'tarfile.StreamError: seeking backwards is not allowed' when e http://bugs.python.org/issue12800 opened by adunand #12801: C realpath not used by os.path.realpath http://bugs.python.org/issue12801 opened by pitrou #12802: Windows error code 267 should be mapped to ENOTDIR, not EINVAL http://bugs.python.org/issue12802 opened by pitrou #12805: Optimizations for bytes.join() et. al http://bugs.python.org/issue12805 opened by jcon #12806: argparse: Hybrid help text formatter http://bugs.python.org/issue12806 opened by GraylinKim #12807: Optimizations for {bytearray,bytes,unicode}.strip() http://bugs.python.org/issue12807 opened by jcon #12808: Coverage of codecs.py http://bugs.python.org/issue12808 opened by tleeuwenburg #12809: Missing new setsockopts in Linux (eg: IP_TRANSPARENT) http://bugs.python.org/issue12809 opened by micolous #12812: libffi does not build with clang on amd64 http://bugs.python.org/issue12812 opened by shenki #12813: uuid4 is not tested if a uuid4 system routine isn't present http://bugs.python.org/issue12813 opened by anacrolix #12814: Possible intermittent bug in test_array http://bugs.python.org/issue12814 opened by ncoghlan #12815: Coverage of smtpd.py http://bugs.python.org/issue12815 opened by tleeuwenburg #12816: smtpd uses library outside of the standard libraries http://bugs.python.org/issue12816 opened by tleeuwenburg #12817: test_multiprocessing: io.BytesIO() requires bytearray buffers http://bugs.python.org/issue12817 opened by skrah #12818: email.utils.formataddr incorrectly quotes parens inside quoted http://bugs.python.org/issue12818 opened by r.david.murray #12819: PEP 393 - Flexible Unicode String Representation http://bugs.python.org/issue12819 opened by torsten.becker #12820: Tests for Lib/xml/dom/minicompat.py http://bugs.python.org/issue12820 opened by John.Chandler #12822: NewGIL should use CLOCK_MONOTONIC if possible. http://bugs.python.org/issue12822 opened by naoki #12823: Broken link in "SSL wrapper for socket objects" document http://bugs.python.org/issue12823 opened by iworm #12825: Missing and incorrect link to a command line option. http://bugs.python.org/issue12825 opened by Kyle.Simpson #12828: xml.dom.minicompat is not documented http://bugs.python.org/issue12828 opened by sandro.tosi #12829: pyexpat segmentation fault caused by multiple calls to Parse() http://bugs.python.org/issue12829 opened by dhgutteridge #12830: --install-data doesn't effect resources destination http://bugs.python.org/issue12830 opened by trevor #12832: The documentation for the print function should explain/point http://bugs.python.org/issue12832 opened by r.david.murray #12833: raw_input misbehaves when readline is imported http://bugs.python.org/issue12833 opened by idank #12834: memoryview.tobytes() incorrect for non-contiguous arrays http://bugs.python.org/issue12834 opened by skrah #12835: Missing SSLSocket.sendmsg() wrapper allows programs to send un http://bugs.python.org/issue12835 opened by baikie #12836: cast() creates circular reference in original object http://bugs.python.org/issue12836 opened by bgilbert #12837: Patch for issue #12810 removed a valid check on socket ancilla http://bugs.python.org/issue12837 opened by baikie #12839: zlibmodule cannot handle Z_VERSION_ERROR zlib error http://bugs.python.org/issue12839 opened by rmtew #12840: "maintainer" value clear the "author" value when register http://bugs.python.org/issue12840 opened by keul #12841: Incorrect tarfile.py extraction http://bugs.python.org/issue12841 opened by seblu #12842: Docs: first parameter of tp_richcompare() always has the corre http://bugs.python.org/issue12842 opened by skrah #12843: file object read* methods in append mode overflows http://bugs.python.org/issue12843 opened by Otacon.Karurosu #12844: Support more than 255 arguments http://bugs.python.org/issue12844 opened by andersk #12845: PEP-3118: C-contiguity with zero strides http://bugs.python.org/issue12845 opened by skrah #12846: unicodedata.normalize turkish letter problem http://bugs.python.org/issue12846 opened by fizymania Most recent 15 issues with no replies (15) ========================================== #12845: PEP-3118: C-contiguity with zero strides http://bugs.python.org/issue12845 #12842: Docs: first parameter of tp_richcompare() always has the corre http://bugs.python.org/issue12842 #12836: cast() creates circular reference in original object http://bugs.python.org/issue12836 #12815: Coverage of smtpd.py http://bugs.python.org/issue12815 #12814: Possible intermittent bug in test_array http://bugs.python.org/issue12814 #12813: uuid4 is not tested if a uuid4 system routine isn't present http://bugs.python.org/issue12813 #12812: libffi does not build with clang on amd64 http://bugs.python.org/issue12812 #12809: Missing new setsockopts in Linux (eg: IP_TRANSPARENT) http://bugs.python.org/issue12809 #12805: Optimizations for bytes.join() et. al http://bugs.python.org/issue12805 #12800: 'tarfile.StreamError: seeking backwards is not allowed' when e http://bugs.python.org/issue12800 #12790: doctest.testmod does not run tests in functools.partial functi http://bugs.python.org/issue12790 #12788: test_email fails with -R http://bugs.python.org/issue12788 #12771: 2to3 -d adds extra whitespace http://bugs.python.org/issue12771 #12742: Add support for CESU-8 encoding http://bugs.python.org/issue12742 #12739: read stuck with multithreading and simultaneous subprocess.Pop http://bugs.python.org/issue12739 Most recent 15 issues waiting for review (15) ============================================= #12842: Docs: first parameter of tp_richcompare() always has the corre http://bugs.python.org/issue12842 #12841: Incorrect tarfile.py extraction http://bugs.python.org/issue12841 #12839: zlibmodule cannot handle Z_VERSION_ERROR zlib error http://bugs.python.org/issue12839 #12837: Patch for issue #12810 removed a valid check on socket ancilla http://bugs.python.org/issue12837 #12835: Missing SSLSocket.sendmsg() wrapper allows programs to send un http://bugs.python.org/issue12835 #12832: The documentation for the print function should explain/point http://bugs.python.org/issue12832 #12822: NewGIL should use CLOCK_MONOTONIC if possible. http://bugs.python.org/issue12822 #12820: Tests for Lib/xml/dom/minicompat.py http://bugs.python.org/issue12820 #12819: PEP 393 - Flexible Unicode String Representation http://bugs.python.org/issue12819 #12818: email.utils.formataddr incorrectly quotes parens inside quoted http://bugs.python.org/issue12818 #12817: test_multiprocessing: io.BytesIO() requires bytearray buffers http://bugs.python.org/issue12817 #12816: smtpd uses library outside of the standard libraries http://bugs.python.org/issue12816 #12815: Coverage of smtpd.py http://bugs.python.org/issue12815 #12813: uuid4 is not tested if a uuid4 system routine isn't present http://bugs.python.org/issue12813 #12809: Missing new setsockopts in Linux (eg: IP_TRANSPARENT) http://bugs.python.org/issue12809 Top 10 most discussed issues (10) ================================= #12326: Linux 3: code should avoid using sys.platform == 'linux2' http://bugs.python.org/issue12326 30 msgs #12678: test_packaging and test_distutils failures under Windows http://bugs.python.org/issue12678 27 msgs #12713: argparse: allow abbreviation of sub commands by users http://bugs.python.org/issue12713 13 msgs #12795: Remove the major version from sys.platform http://bugs.python.org/issue12795 12 msgs #5231: Change format of a memoryview http://bugs.python.org/issue5231 9 msgs #11564: pickle not 64-bit ready http://bugs.python.org/issue11564 9 msgs #12801: C realpath not used by os.path.realpath http://bugs.python.org/issue12801 9 msgs #12808: Coverage of codecs.py http://bugs.python.org/issue12808 8 msgs #5113: 2.5.4.3 / test_posix failing on HPUX systems http://bugs.python.org/issue5113 7 msgs #12760: Add create mode to open() http://bugs.python.org/issue12760 7 msgs Issues closed (34) ================== #4106: multiprocessing occasionally spits out exception during shutdo http://bugs.python.org/issue4106 closed by pitrou #5301: add mimetype for image/vnd.microsoft.icon (patch) http://bugs.python.org/issue5301 closed by sandro.tosi #6484: No unit test for mailcap module http://bugs.python.org/issue6484 closed by python-dev #6560: socket sendmsg(), recvmsg() methods http://bugs.python.org/issue6560 closed by python-dev #9200: Make the str.is* methods work with non-BMP chars on narrow bui http://bugs.python.org/issue9200 closed by ezio.melotti #11657: multiprocessing_{send,recv}fd fail with fds > 256 http://bugs.python.org/issue11657 closed by pitrou #12191: Add shutil.chown to allow to use user and group name (and not http://bugs.python.org/issue12191 closed by sandro.tosi #12213: BufferedRandom: issues with interlaced read-write http://bugs.python.org/issue12213 closed by pitrou #12461: it's not clear how the shutil.copystat() should work on symlin http://bugs.python.org/issue12461 closed by eric.araujo #12656: test.test_asyncore: add tests for AF_INET6 and AF_UNIX sockets http://bugs.python.org/issue12656 closed by neologix #12682: Meaning of 'accepted' resolution as documented in devguide http://bugs.python.org/issue12682 closed by ezio.melotti #12745: Python2 or Python3 page http://bugs.python.org/issue12745 closed by terry.reedy #12772: fractional day attribute in datetime class http://bugs.python.org/issue12772 closed by belopolsky #12775: immense performance problems related to the garbage collector http://bugs.python.org/issue12775 closed by terry.reedy #12778: JSON-serializing a large container takes too much memory http://bugs.python.org/issue12778 closed by pitrou #12783: test_posix failure on FreeBSD 6.4: test_get_and_set_scheduler_ http://bugs.python.org/issue12783 closed by neologix #12786: subprocess wait() hangs when stdin is closed http://bugs.python.org/issue12786 closed by neologix #12787: xmlrpc.client documentation (MultiCall Objects) points to a br http://bugs.python.org/issue12787 closed by sandro.tosi #12789: re.Scanner doesn't support more than 2 groups on regex http://bugs.python.org/issue12789 closed by angelonuffer #12791: reference cycle with exception state not broken by generator.c http://bugs.python.org/issue12791 closed by pitrou #12792: Document the "type" field of the tracker in the devguide http://bugs.python.org/issue12792 closed by ezio.melotti #12794: platform: add a function to get the system version as tuple http://bugs.python.org/issue12794 closed by haypo #12796: total_ordering goes into infinite recursion when NotImplemente http://bugs.python.org/issue12796 closed by ncoghlan #12799: realpath not resolving symbolic links under Windows http://bugs.python.org/issue12799 closed by haypo #12803: SSLContext.load_cert_chain() should accept a password argument http://bugs.python.org/issue12803 closed by pitrou #12804: make test should not enable the urlfetch resource http://bugs.python.org/issue12804 closed by nadeem.vawda #12810: Remove check for negative unsigned value in socketmodule.c http://bugs.python.org/issue12810 closed by neologix #12811: Tabnanny doesn't close its tokenize files properly http://bugs.python.org/issue12811 closed by ncoghlan #12821: test_fcntl failed on OpenBSD 5.x http://bugs.python.org/issue12821 closed by neologix #12824: Make the write_file() helper function in test_shutil return th http://bugs.python.org/issue12824 closed by hynek #12826: module _socket failed to build on OpenBSD http://bugs.python.org/issue12826 closed by python-dev #12827: OS-specific location in Lib/tempfile.py for OpenBSD http://bugs.python.org/issue12827 closed by neologix #12831: 2to3 and integer division http://bugs.python.org/issue12831 closed by mark.dickinson #12838: FAQ/Programming typo: range[3] is used http://bugs.python.org/issue12838 closed by python-dev From brett at python.org Fri Aug 26 18:35:12 2011 From: brett at python.org (Brett Cannon) Date: Fri, 26 Aug 2011 09:35:12 -0700 Subject: [Python-Dev] Planned PEP status changes In-Reply-To: References: Message-ID: On Tue, Aug 23, 2011 at 19:42, Nick Coghlan wrote: > Unless I hear any objections, I plan to adjust the current PEP > statuses as follows some time this weekend: > > Move from Accepted to Finished: > > ? ?389 ?argparse - New Command Line Parsing Module ? ? ? ? ? ? ?Bethard > ? ?391 ?Dictionary-Based Configuration For Logging ? ? ? ? ? ? ?Sajip > ? ?3108 ?Standard Library Reorganization ? ? ? ? ? ? ? ? ? ? ? ? Cannon I had always hoped to get profile/cProfile taken care of, but obviously that just didn't ever happen. So no objection, just a slight sting from the reminder of why the PEP was left open. -Brett > ? ?3135 ?New Super > Spealman, Delaney, Ryan > > Move from Accepted to Withdrawn (with a reference to Reid Kleckner's blog post) > ? ?3146 ?Merging Unladen Swallow into CPython > Winter, Yasskin, Kleckner > > > The PEP 3118 enhanced buffer protocol has some ongoing semantic and > implementation issues still to be worked out, so I plan to leave that > at Accepted. Ditto for PEP 3121 (extension module finalisation), since > that doesn't play nicely with the current 'set everything to None' > approach to breaking cycles during module finalisation. > > The other Accepted PEPs are either packaging standards related or > genuinely not implemented yet. > > Cheers, > Nick. > > -- > Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org > From guido at python.org Fri Aug 26 18:51:00 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 09:51:00 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E576793.2010203@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. L?wis" wrote: >> IronPython and Jython can retain UTF-16 as their native form if that >> makes interop cleaner, but in doing so they need to ensure that basic >> operations like indexing and len work in terms of code points, not >> code units, if they are to conform. > > That means that they won't conform, period. There is no efficient > maintainable implementation strategy to achieve that property, and > it may take well years until somebody provides an efficient > unmaintainable implementation. > >> Does this make sense, or have I completely misunderstood things? > > You seem to assume it is ok for Jython/IronPython to provide indexing in > O(n). It is not. Indeed. > However, non-conformance may not be that much of an issue. They do not > conform in many other aspects, either (such as not supporting Python 3, > for example, or not supporting the C API) that they may well chose to > ignore such a minor requirement if there was one. For BMP strings, > they conform fine, and it may well be that Jython eithers either don't > have non-BMP strings, or don't care whether len() or indexing of their > non-BMP strings is "correct". I think this is fine. I had been hoping that all Python implementations claiming compatibility with version 3.3 of the language reference would be free of worries about surrogates, but it simply doesn't make sense. And yes, I'm well aware that PEP 393 is only for CPython. It's just that I had hoped that it would get rid of some of Tom C's specific complaints for all Python implementations; but it really seems impossible to do so. One consequence may be that the standard library, to the extent it is shared by other implementations, may still have to worry about surrogates and other issues inherent in narrow builds or other 16-bit-based string types. We'll cross that bridge when we get to it. -- --Guido van Rossum (python.org/~guido) From martin at v.loewis.de Fri Aug 26 18:56:07 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 26 Aug 2011 18:56:07 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> Message-ID: <4E57D027.10204@v.loewis.de> Am 26.08.2011 17:55, schrieb Stefan Behnel: > Stefan Behnel, 25.08.2011 23:30: >> Sadly, a quick look at a couple of recent commits in the pep-393 branch >> suggested that it is not even always obvious to you as the authors which >> macros can be called safely and which cannot. I immediately spotted a bug >> in one of the updated core functions (unicode_repr, IIRC) where >> PyUnicode_GET_LENGTH() is called without a previous call to >> PyUnicode_FAST_READY(). > > Here is another example from unicodeobject.c, commit 56aaa17fc05e: > > + switch(PyUnicode_KIND(string)) { > + case PyUnicode_1BYTE_KIND: > + list = ucs1lib_splitlines( > + (PyObject*) string, PyUnicode_1BYTE_DATA(string), > + PyUnicode_GET_LENGTH(string), keepends); > + break; > + case PyUnicode_2BYTE_KIND: > + list = ucs2lib_splitlines( > + (PyObject*) string, PyUnicode_2BYTE_DATA(string), > + PyUnicode_GET_LENGTH(string), keepends); > + break; > + case PyUnicode_4BYTE_KIND: > + list = ucs4lib_splitlines( > + (PyObject*) string, PyUnicode_4BYTE_DATA(string), > + PyUnicode_GET_LENGTH(string), keepends); > + break; > + default: > + assert(0); > + list = 0; > + } > > The assert(0) at the end will hit when the system is running out of > memory while working on a wchar string. No, that should not happen: it should never get to this point. I agree with your observation that somebody should be done about error handling, and will update the PEP shortly. I propose that PyUnicode_Ready should be explicitly called on input where raising an exception is feasible. In contexts where it is not feasible (such as reading a character, or reading the length or the kind), failing to ready the string should cause a fatal error. What do you think? Regards, Martin From guido at python.org Fri Aug 26 19:02:46 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 10:02:46 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote: > "Martin v. L?wis", 26.08.2011 11:29: >> >> You seem to assume it is ok for Jython/IronPython to provide indexing in >> O(n). It is not. > > I think we can leave this discussion aside. (And yet, you keep arguing. :-) > Jython and IronPython have their > own platform specific constraints to which they need to adapt their > implementation. For a Jython user, it means a lot to be able to efficiently > pass strings (and other data) back and forth between Jython and other JVM > code, and it's not hard to guess that the same is true for IronPython/.NET > users. After all, the platform integration is the very *reason* for most > users to select one of these implementations. Right. > Besides, what if these implementations provided indexing in, say, O(log N) > instead of O(1) or O(N), e.g. by building a tree index into each string? You > could have an index that simply marks runs of surrogate pairs and BMP > substrings, thus providing a likely-to-be-somewhat-compact index. That index > would obviously have to be built, but so do the different string > representations in post-PEP-393 CPython, especially on Windows, as I have > learned. Eek. No, please. Those platforms' native string types have length and slicing operations that are O(1) and work in terms of 16-bit code points. Python should use those. It would be awful if Java and Python code doing the same manipulations on the same string would come to different conclusions because Python tried to paper over surrogates. I dug up some evidence for Java, at least: http://download.oracle.com/javase/1,5.0/docs/api/java/lang/CharSequence.html#length%28%29 """ length int length() Returns the length of this character sequence. The length is the number of 16-bit chars in the sequence. Returns: the number of chars in this sequence """ This is quite explicit about counting 16-bit code units. I've found similar info about .NET, which defines "char" as a 16-bit quantity and string length in terms of the number of "char" items. > Would such a less severe violation of the strict O(1) rule still be "not > ok"? I think this is not such a clear black-and-white issue. Both > implementations have notably different performance characteristics than > CPython in some more or less important areas, as does PyPy. At some point, > the language compliance label has to account for that. Since you had to ask, I have to declare that, indeed, non-O(1) behavior would not be okay for those platforms. All in all, I don't think we should legislate Python strings to be able to support 21-bit code points using O(1) indexing. PEP 393 makes this possible for CPython, and it's been said that PyPy can follow suit. But it'll be a "quality-of-implementation" issue, not built into the language spec. -- --Guido van Rossum (python.org/~guido) From p.f.moore at gmail.com Fri Aug 26 19:13:42 2011 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 26 Aug 2011 18:13:42 +0100 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: On 26 August 2011 17:51, Guido van Rossum wrote: > On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. L?wis" wrote: (Regarding my comments on code point semantics) >> You seem to assume it is ok for Jython/IronPython to provide indexing in >> O(n). It is not. > > Indeed. On 26 August 2011 18:02, Guido van Rossum wrote: > Eek. No, please. Those platforms' native string types have length and > slicing operations that are O(1) and work in terms of 16-bit code > points. Python should use those. It would be awful if Java and Python > code doing the same manipulations on the same string would come to > different conclusions because Python tried to paper over surrogates. *That* is actually the erroneous assumption I had made - that the Java and .NET native string type had code point semantics (i.e., took surrogates into account). As that isn't the case, my comments aren't valid - and I agree that having common semantics (and hence exposing surrogates) is too important to lose. On the other hand, that pretty much establishes that whatever PEP 393 achieves in terms of allowing all builds of CPython to offer code point semantics, the language definition can't mandate it. Thanks for the clarification. Paul. From andrew.pennebaker at gmail.com Fri Aug 26 19:18:55 2011 From: andrew.pennebaker at gmail.com (Andrew Pennebaker) Date: Fri, 26 Aug 2011 13:18:55 -0400 Subject: [Python-Dev] Windows installers and %PATH% In-Reply-To: References: Message-ID: I see that the Ruby 1.9 stable Windows installer has a checkbox to add the Ruby binaries to PATH. That would be excellent for Python. Also, there's no need to "buy in" to the Windows toolchain just to edit PATH. Installer software includes functionality for editing environment variables, and in any case Python has built in environment variable editing, even for Windows. Cheers, Andrew Pennebaker www.yellosoft.us On Fri, Aug 26, 2011 at 9:40 AM, Brian Curtin wrote: > On Thu, Aug 25, 2011 at 23:04, Andrew Pennebaker < > andrew.pennebaker at gmail.com> wrote: > >> Please have the Windows installers add the Python installation directory >> to the PATH environment variable. > > > The http://bugs.python.org bug tracker is a better place for feature > requests like this, of which there have been several over the years. This > has become a hotter topic lately with several discussions around the > community, and a PEP to provide some similar functionality. I've talked with > several educators/trainers around and the lack of a Path installation is the > #1 thing that bites their newcomers, and it's an issue that bites them > before they've even begun to learn. > > Many newbies dive in without knowing that they must manually add >> C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that >> should have been done by the installers way back in Python 1.0. >> >> Please also add PYTHONROOT\Scripts. It's where cool things like >> easy_install.exe are stored. More yak shaving. >> > > A clean installation of Python includes no Scripts directory, so I'm not > sure we should be polluting the Path with yet-to-exist directories. An > approach could be to have packaging optionally add the scripts directory on > the installation of a third-party package. > > The only potential downside to this is upsetting users who manage multiple >> python installations. It's not a problem: they already manually adjust PATH >> to their liking. >> > > "Users who manage multiple python installations" is probably a very, very > large number, so we have quite the audience to appease, and it actually is a > problem. We should not go halfway on this feature and say "if it doesn't > work perfectly, you're back to being on your own". I think the likely case > is that any path addition feature will read the path, then offer to replace > existing instances or append to the end. > > I haven't yet done any work on this, but my todo list for 3.3 includes > adding some path related features to the installer. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew.pennebaker at gmail.com Fri Aug 26 19:21:18 2011 From: andrew.pennebaker at gmail.com (Andrew Pennebaker) Date: Fri, 26 Aug 2011 13:21:18 -0400 Subject: [Python-Dev] Windows installers and %PATH% In-Reply-To: References: Message-ID: I mentioned PYTHONROOT\Script because of the distribute package, which adds PYTHONROOT\Script\easy_install.exe. My mistake if \Script is created by distribute and not Python. Then my beef is with distribute for not adding its binaries to PATH--how else would I use easy_setup if not in a terminal? Cheers, Andrew Pennebaker www.yellosoft.us On Fri, Aug 26, 2011 at 9:40 AM, Brian Curtin wrote: > On Thu, Aug 25, 2011 at 23:04, Andrew Pennebaker < > andrew.pennebaker at gmail.com> wrote: > >> Please have the Windows installers add the Python installation directory >> to the PATH environment variable. > > > The http://bugs.python.org bug tracker is a better place for feature > requests like this, of which there have been several over the years. This > has become a hotter topic lately with several discussions around the > community, and a PEP to provide some similar functionality. I've talked with > several educators/trainers around and the lack of a Path installation is the > #1 thing that bites their newcomers, and it's an issue that bites them > before they've even begun to learn. > > Many newbies dive in without knowing that they must manually add >> C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that >> should have been done by the installers way back in Python 1.0. >> >> Please also add PYTHONROOT\Scripts. It's where cool things like >> easy_install.exe are stored. More yak shaving. >> > > A clean installation of Python includes no Scripts directory, so I'm not > sure we should be polluting the Path with yet-to-exist directories. An > approach could be to have packaging optionally add the scripts directory on > the installation of a third-party package. > > The only potential downside to this is upsetting users who manage multiple >> python installations. It's not a problem: they already manually adjust PATH >> to their liking. >> > > "Users who manage multiple python installations" is probably a very, very > large number, so we have quite the audience to appease, and it actually is a > problem. We should not go halfway on this feature and say "if it doesn't > work perfectly, you're back to being on your own". I think the likely case > is that any path addition feature will read the path, then offer to replace > existing instances or append to the end. > > I haven't yet done any work on this, but my todo list for 3.3 includes > adding some path related features to the installer. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Aug 26 19:26:38 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 10:26:38 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: On Fri, Aug 26, 2011 at 10:13 AM, Paul Moore wrote: > On 26 August 2011 18:02, Guido van Rossum wrote: > >> Eek. No, please. Those platforms' native string types have length and >> slicing operations that are O(1) and work in terms of 16-bit code >> points. Python should use those. It would be awful if Java and Python >> code doing the same manipulations on the same string would come to >> different conclusions because Python tried to paper over surrogates. > > *That* is actually the erroneous assumption I had made - that the Java > and .NET native string type had code point semantics (i.e., took > surrogates into account). As that isn't the case, my comments aren't > valid - and I agree that having common semantics (and hence exposing > surrogates) is too important to lose. Those platforms probably *also* have libraries of operations to support writing apps that conform to the Unicode standard. But those apps will have to be aware of the difference between the "naive" length of a string and the number of code points of characters in it. > On the other hand, that pretty much establishes that whatever PEP 393 > achieves in terms of allowing all builds of CPython to offer code > point semantics, the language definition can't mandate it. The most severe consequence to me seems that the stdlib (which is reused by those other platforms) cannot assume CPython's ideal world -- even if specific apps sometimes can. -- --Guido van Rossum (python.org/~guido) From brian.curtin at gmail.com Fri Aug 26 19:30:49 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Fri, 26 Aug 2011 12:30:49 -0500 Subject: [Python-Dev] Windows installers and %PATH% In-Reply-To: References: Message-ID: On Fri, Aug 26, 2011 at 12:18, Andrew Pennebaker < andrew.pennebaker at gmail.com> wrote: > Also, there's no need to "buy in" to the Windows toolchain just to edit > PATH. Installer software includes functionality for editing environment > variables, and in any case Python has built in environment variable editing, > even for Windows. > The built-in environment variable support, e.g., os.getenv/putenv/environ, isn't helpful here as it does not modify the global environment. It modifies the current process and usually subprocesses. The proper way to apply environment variable changes to the entire system is via the registry and broadcasting a setting change message. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Fri Aug 26 20:14:17 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 20:14:17 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: Guido van Rossum, 26.08.2011 19:02: > On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote: >> Besides, what if these implementations provided indexing in, say, O(log N) >> instead of O(1) or O(N), e.g. by building a tree index into each string? You >> could have an index that simply marks runs of surrogate pairs and BMP >> substrings, thus providing a likely-to-be-somewhat-compact index. That index >> would obviously have to be built, but so do the different string >> representations in post-PEP-393 CPython, especially on Windows, as I have >> learned. > > Eek. No, please. I was mostly just confabulating. My main point was that this isn't a black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. You can achieve compliant/acceptable behaviour at the code point level, the performance guarantees level or the platform integration level - choose any two. CPython is just lucky that there isn't really a platform integration level to take into account (if we leave the Windows environment aside for a moment). > Those platforms' native string types have length and > slicing operations that are O(1) and work in terms of 16-bit code > points. Python should use those. It would be awful if Java and Python > code doing the same manipulations on the same string would come to > different conclusions because Python tried to paper over surrogates. I fully agree. >> Would such a less severe violation of the strict O(1) rule still be "not >> ok"? I think this is not such a clear black-and-white issue. Both >> implementations have notably different performance characteristics than >> CPython in some more or less important areas, as does PyPy. At some point, >> the language compliance label has to account for that. > > Since you had to ask, I have to declare that, indeed, non-O(1) > behavior would not be okay for those platforms. I take it that you say that because you want strings to perform in the 'normal' platform specific way here (i.e. like Java/.NET strings), and not so much because you want to require the exact same (performance) characteristics across Python implementations. So your choice is platform integration over code points, leaving the identical performance as a side-effect of the platform integration. > All in all, I don't think we should legislate Python strings to be > able to support 21-bit code points using O(1) indexing. PEP 393 makes > this possible for CPython, and it's been said that PyPy can follow > suit. But it'll be a "quality-of-implementation" issue, not built into > the language spec. Makes sense to me. Most likely, Unicode heavy Python code will have to take platform specifics into account anyway, so there are limits as to what is suitable for a language spec. Stefan From stefan_ml at behnel.de Fri Aug 26 20:28:43 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 20:28:43 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E57D027.10204@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <4E57D027.10204@v.loewis.de> Message-ID: "Martin v. L?wis", 26.08.2011 18:56: > I agree with your observation that somebody should be done about error > handling, and will update the PEP shortly. I propose that > PyUnicode_Ready should be explicitly called on input where raising an > exception is feasible. In contexts where it is not feasible (such > as reading a character, or reading the length or the kind), failing to > ready the string should cause a fatal error. I consider this an increase in complexity. It will then no longer be enough to access the data, the user will first have to figure out a suitable place in the code to make sure it's actually there, potentially forgetting about it because it works in all test cases, or potentially triggering a huge amount of overhead that copies and 'recodes' the string data by executing one of the macros that does it automatically. For the specific case of Cython, I would guess that I could just add another special case that reads the data from the Py_UNICODE buffer and combines surrogates at need, but that will only work in some cases (specifically not for indexing). And outside of Cython, most normal user code won't do that. My gut feeling leans towards a KISS approach. If you go the route to require an explicit point for triggering PyUnicode_Ready() calls, why not just go all the way and make it completely explicit in *all* cases? I.e. remove all implicit calls from the macros and make it part of the new API semantics that users *must* call PyUnicode_FAST_READY() before doing anything with a new string data layout. Much fewer surprises. Note that there isn't currently an official macro way to figure out that the flexible string layout has not been initialised yet, i.e. that wstr is set but str is not. If the implicit PyUnicode_Ready() calls get removed, PyUnicode_KIND() could take that place by simply returning WSTR_KIND. That being said, the main problem I currently see is that basically all existing code needs to be updated in order to handle these errors. Otherwise, it would be possible to trigger crashes by properly forging a string and passing it into an unprepared C library to let it run into a NULL pointer return value of PyUnicode_AS_UNICODE(). Stefan From stefan_ml at behnel.de Fri Aug 26 21:58:52 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Aug 2011 21:58:52 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <4E57D027.10204@v.loewis.de> Message-ID: Stefan Behnel, 26.08.2011 20:28: > "Martin v. L?wis", 26.08.2011 18:56: >> I agree with your observation that somebody should be done about error >> handling, and will update the PEP shortly. I propose that >> PyUnicode_Ready should be explicitly called on input where raising an >> exception is feasible. In contexts where it is not feasible (such >> as reading a character, or reading the length or the kind), failing to >> ready the string should cause a fatal error. >[...] > My gut feeling leans towards a KISS approach. If you go the route to > require an explicit point for triggering PyUnicode_Ready() calls, why not > just go all the way and make it completely explicit in *all* cases? I.e. > remove all implicit calls from the macros and make it part of the new API > semantics that users *must* call PyUnicode_FAST_READY() before doing > anything with a new string data layout. Much fewer surprises. > > Note that there isn't currently an official macro way to figure out that > the flexible string layout has not been initialised yet, i.e. that wstr is > set but str is not. If the implicit PyUnicode_Ready() calls get removed, > PyUnicode_KIND() could take that place by simply returning WSTR_KIND. Here's a patch that updates only the header file, to make it clear what I mean. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: simplified-pep-393-api.patch Type: text/x-patch Size: 4637 bytes Desc: not available URL: From guido at python.org Fri Aug 26 23:45:17 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 14:45:17 -0700 Subject: [Python-Dev] Should we move to replace re with regex? Message-ID: I just made a pass of all the Unicode-related bugs filed by Tom Christiansen, and found that in several, the response was "this is fixed in the regex module [by Matthew Barnett]". I started replying that I thought that we should fix the bugs in the re module (i.e., really in _sre.c) but on second thought I wonder if maybe regex is mature enough to replace re in Python 3.3. It would mean that we won't fix any of these bugs in earlier Python versions, but I could live with that. However, I don't know much about regex -- how compatible is it, how fast is it (including extreme cases where the backtracking goes crazy), how bug-free is it, and so on. Plus, how much work would it be to actually incorporate it into CPython as a complete drop-in replacement of the re package (such that nobody needs to change their imports or the flags they pass to the re module). We'd also probably have to train some core developers to be familiar enough with the code to maintain and evolve it -- I assume we can't just volunteer Matthew to do so forever... :-) What's the alternative? Is adding the requested bug fixes and new features to _sre.c really that hard? -- --Guido van Rossum (python.org/~guido) From victor.stinner at haypocalc.com Fri Aug 26 23:37:42 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Fri, 26 Aug 2011 23:37:42 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <6C7ABA8B4E309440B857D74348836F2E28F378E2@TK5EX14MBXC292.redmond.corp.microsoft.com> References: <6C7ABA8B4E309440B857D74348836F2E28F378E2@TK5EX14MBXC292.redmond.corp.microsoft.com> Message-ID: <201108262337.42349.victor.stinner@haypocalc.com> Le vendredi 26 ao?t 2011 02:01:42, Dino Viehland a ?crit : > The biggest difficulty for IronPython here would be dealing w/ .NET > interop. We can certainly introduce either an IronPython specific string > class which is similar to CPython's PyUnicodeObject or we could have > multiple distinct .NET types (IronPython.Runtime.AsciiString, > System.String, and > IronPython.Runtime.Ucs4String) which all appear as the same type to Python. > > But when Python is calling a .NET API it's always going to return a > System.String which is UTF-16. If we had to check and convert all of > those strings when they cross into Python it would be very bad for > performance. Presumably we could have a 4th type of "interop" string > which lazily computes this but if we start wrapping .Net strings we could > also get into object identity issues. Python 3 encodes all Unicode strings to the OS encoding (and the result is decoded) for all syscalls and calls to libraries: to the locale encoding on UNIX, to UTF-16 on Windows. Currently, Py_UNICODE is wchar_t which is 16 bits. So Py_UNICODE* is already a UTF-16 string. I don't know if the overhead of the PEP 393 (encode to UTF-16 on Windows) for these calls is important or not. But on UNIX, pure ASCII string don't have to be encoded anymore if the locale encoding is UTF-8 or ASCII. IronPython can wait to see how CPython+PEP 383 handles these problems, and how slower it is. > But it's a huge change - it'll almost certainly touch every single source > file in IronPython. With the PEP 393, it's transparent: the PyUnicode_AS_UNICODE encodes the string to UTF-16 (allocate memory, etc.). Except that applications should now check if an error occurred (check for NULL). > I would think we'd get 3.2 done first and then think > about what to do here. I don't think that IronPython needs to support non-BMP characters without using surrogates. Bug reports about non-BMP characters usually don't have use cases, but just want to make Python perfect. There is no need to hurry. PEP 393 tries to reduce the memory footprint. The effect on non-BMP character is just a *nice* border effect. Or was the PEP design to solve narrow build issues? Victor From guido at python.org Sat Aug 27 00:00:07 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 15:00:07 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <201108262337.42349.victor.stinner@haypocalc.com> References: <6C7ABA8B4E309440B857D74348836F2E28F378E2@TK5EX14MBXC292.redmond.corp.microsoft.com> <201108262337.42349.victor.stinner@haypocalc.com> Message-ID: I have a different question about IronPython and Jython now. Do their regular expression libraries support Unicode better than CPython's? E.g. does "." match a surrogate pair? Tom C suggests that Java's regex libraries get this and many other details right despite Java's use of UTF-16 to represent strings. So hopefully Jython's re library is built on top of Java's? PS. Is there a better contact for Jython? -- --Guido van Rossum (python.org/~guido) From mal at egenix.com Sat Aug 27 00:09:10 2011 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 27 Aug 2011 00:09:10 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: Message-ID: <4E581986.3000709@egenix.com> Guido van Rossum wrote: > I just made a pass of all the Unicode-related bugs filed by Tom > Christiansen, and found that in several, the response was "this is > fixed in the regex module [by Matthew Barnett]". I started replying > that I thought that we should fix the bugs in the re module (i.e., > really in _sre.c) but on second thought I wonder if maybe regex is > mature enough to replace re in Python 3.3. It would mean that we won't > fix any of these bugs in earlier Python versions, but I could live > with that. > > However, I don't know much about regex -- how compatible is it, how > fast is it (including extreme cases where the backtracking goes > crazy), how bug-free is it, and so on. Plus, how much work would it be > to actually incorporate it into CPython as a complete drop-in > replacement of the re package (such that nobody needs to change their > imports or the flags they pass to the re module). > > We'd also probably have to train some core developers to be familiar > enough with the code to maintain and evolve it -- I assume we can't > just volunteer Matthew to do so forever... :-) > > What's the alternative? Is adding the requested bug fixes and new > features to _sre.c really that hard? Why not simply add the new lib, see whether it works out and then decide which path to follow. We've done that with the old regex lib. It took a few years and releases to have people port their applications to the then new re module and syntax, but in the end it worked. With a new regex library there are likely going to be quite a few subtle differences between re and regex - even if it's just doing things in a more Unicode compatible way. I don't think anyone can actually list all the differences given the complex nature of regular expressions, so people will likely need a few years and releases to get used it before a switch can be made. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From guido at python.org Sat Aug 27 00:18:35 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 15:18:35 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E581986.3000709@egenix.com> References: <4E581986.3000709@egenix.com> Message-ID: On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg wrote: > Guido van Rossum wrote: >> I just made a pass of all the Unicode-related bugs filed by Tom >> Christiansen, and found that in several, the response was "this is >> fixed in the regex module [by Matthew Barnett]". I started replying >> that I thought that we should fix the bugs in the re module (i.e., >> really in _sre.c) but on second thought I wonder if maybe regex is >> mature enough to replace re in Python 3.3. It would mean that we won't >> fix any of these bugs in earlier Python versions, but I could live >> with that. >> >> However, I don't know much about regex -- how compatible is it, how >> fast is it (including extreme cases where the backtracking goes >> crazy), how bug-free is it, and so on. Plus, how much work would it be >> to actually incorporate it into CPython as a complete drop-in >> replacement of the re package (such that nobody needs to change their >> imports or the flags they pass to the re module). >> >> We'd also probably have to train some core developers to be familiar >> enough with the code to maintain and evolve it -- I assume we can't >> just volunteer Matthew to do so forever... :-) >> >> What's the alternative? Is adding the requested bug fixes and new >> features to _sre.c really that hard? > > Why not simply add the new lib, see whether it works out and > then decide which path to follow. > > We've done that with the old regex lib. It took a few years > and releases to have people port their applications to the > then new re module and syntax, but in the end it worked. > > With a new regex library there are likely going to be quite > a few subtle differences between re and regex - even if it's > just doing things in a more Unicode compatible way. > > I don't think anyone can actually list all the differences given > the complex nature of regular expressions, so people will > likely need a few years and releases to get used it before > a switch can be made. I can't say I liked how that transition was handled last time around. I really don't want to have to tell people "Oh, that bug is fixed but you have to use regex instead of re" and then a few years later have to tell them "Oh, we're deprecating regex, you should just use re". I'm really hoping someone has more actual technical understanding of re vs. regex and can give us some facts about the differences, rather than, frankly, FUD. -- --Guido van Rossum (python.org/~guido) From solipsis at pitrou.net Sat Aug 27 00:33:59 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 00:33:59 +0200 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E581986.3000709@egenix.com> Message-ID: <20110827003359.43416085@pitrou.net> On Fri, 26 Aug 2011 15:18:35 -0700 Guido van Rossum wrote: > > I can't say I liked how that transition was handled last time around. > I really don't want to have to tell people "Oh, that bug is fixed but > you have to use regex instead of re" and then a few years later have > to tell them "Oh, we're deprecating regex, you should just use re". > > I'm really hoping someone has more actual technical understanding of > re vs. regex and can give us some facts about the differences, rather > than, frankly, FUD. The best way would be to contact the author, Matthew Barnett, or to ask on the tracker on http://bugs.python.org/issue2636. He has been quite willing to answer such questions in the past, AFAIR. Regards Antoine. From guido at python.org Sat Aug 27 00:47:21 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 15:47:21 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110827003359.43416085@pitrou.net> References: <4E581986.3000709@egenix.com> <20110827003359.43416085@pitrou.net> Message-ID: On Fri, Aug 26, 2011 at 3:33 PM, Antoine Pitrou wrote: > On Fri, 26 Aug 2011 15:18:35 -0700 > Guido van Rossum wrote: >> >> I can't say I liked how that transition was handled last time around. >> I really don't want to have to tell people "Oh, that bug is fixed but >> you have to use regex instead of re" and then a few years later have >> to tell them "Oh, we're deprecating regex, you should just use re". >> >> I'm really hoping someone has more actual technical understanding of >> re vs. regex and can give us some facts about the differences, rather >> than, frankly, FUD. > > The best way would be to contact the author, Matthew Barnett, I had added him to the beginning of this thread but someone took him off. > or to ask > on the tracker on http://bugs.python.org/issue2636. He has been quite > willing to answer such questions in the past, AFAIR. So, that issue is about something called "regexp". AFAIK Matthew (MRAB) wrote something called "regex" (http://pypi.python.org/pypi/regex). Are they two different things??? -- --Guido van Rossum (python.org/~guido) From drsalists at gmail.com Sat Aug 27 00:48:42 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Fri, 26 Aug 2011 15:48:42 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: Message-ID: On Fri, Aug 26, 2011 at 2:45 PM, Guido van Rossum wrote: > ...but on second thought I wonder if maybe regex is > mature enough to replace re in Python 3.3. > I agree that the move from regex to re was kind of painful. It seems someone should merge the unit tests for re and regex, and apply the merged result to each for the sake of comparison. There might also be a need to expand the merged result to include new things. Then there probably should be a from __future__ import for a while. -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Sat Aug 27 00:54:42 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 00:54:42 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: Message-ID: <4E582432.2080301@v.loewis.de> > However, I don't know much about regex The problem really is: nobody does (except for Matthew Barnett probably). This means that this contribution might be stuck "forever": somebody would have to review the module, identify issues, approve it, and take the blame if something breaks. That takes considerable time and has a considerable risk, for little expected glory - so nobody has volunteered to mentor/manage integration of that code. I believe most core contributors (who have run into this code) consider it worthwhile, but are just too scared to take action. Among us, some are more "regex gurus" than others; you know who you are. I guess the PSF would pay for the review, if that is what it would take. Regards, Martin From guido at python.org Sat Aug 27 00:57:26 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 15:57:26 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E582432.2080301@v.loewis.de> References: <4E582432.2080301@v.loewis.de> Message-ID: On Fri, Aug 26, 2011 at 3:54 PM, "Martin v. L?wis" wrote: >> However, I don't know much about regex > > The problem really is: nobody does (except for Matthew Barnett > probably). This means that this contribution might be stuck > "forever": somebody would have to review the module, identify > issues, approve it, and take the blame if something breaks. > That takes considerable time and has a considerable risk, for > little expected glory - so nobody has volunteered to > mentor/manage integration of that code. > > I believe most core contributors (who have run into this code) > consider it worthwhile, but are just too scared to take action. > > Among us, some are more "regex gurus" than others; you know > who you are. I guess the PSF would pay for the review, if that > is what it would take. Makes sense. I noticed Ezio seems quite in favor of regex. Maybe he knows more? -- --Guido van Rossum (python.org/~guido) From mal at egenix.com Sat Aug 27 01:00:31 2011 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 27 Aug 2011 01:00:31 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E581986.3000709@egenix.com> Message-ID: <4E58258F.9050204@egenix.com> Guido van Rossum wrote: > On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg wrote: >> Guido van Rossum wrote: >>> I just made a pass of all the Unicode-related bugs filed by Tom >>> Christiansen, and found that in several, the response was "this is >>> fixed in the regex module [by Matthew Barnett]". I started replying >>> that I thought that we should fix the bugs in the re module (i.e., >>> really in _sre.c) but on second thought I wonder if maybe regex is >>> mature enough to replace re in Python 3.3. It would mean that we won't >>> fix any of these bugs in earlier Python versions, but I could live >>> with that. >>> >>> However, I don't know much about regex -- how compatible is it, how >>> fast is it (including extreme cases where the backtracking goes >>> crazy), how bug-free is it, and so on. Plus, how much work would it be >>> to actually incorporate it into CPython as a complete drop-in >>> replacement of the re package (such that nobody needs to change their >>> imports or the flags they pass to the re module). >>> >>> We'd also probably have to train some core developers to be familiar >>> enough with the code to maintain and evolve it -- I assume we can't >>> just volunteer Matthew to do so forever... :-) >>> >>> What's the alternative? Is adding the requested bug fixes and new >>> features to _sre.c really that hard? >> >> Why not simply add the new lib, see whether it works out and >> then decide which path to follow. >> >> We've done that with the old regex lib. It took a few years >> and releases to have people port their applications to the >> then new re module and syntax, but in the end it worked. >> >> With a new regex library there are likely going to be quite >> a few subtle differences between re and regex - even if it's >> just doing things in a more Unicode compatible way. >> >> I don't think anyone can actually list all the differences given >> the complex nature of regular expressions, so people will >> likely need a few years and releases to get used it before >> a switch can be made. > > I can't say I liked how that transition was handled last time around. > I really don't want to have to tell people "Oh, that bug is fixed but > you have to use regex instead of re" and then a few years later have > to tell them "Oh, we're deprecating regex, you should just use re". No, you tell them: "If you want Unicode 6 semantics, use regex, if you're fine with Unicode 2.0/3.0 semantics, use re". After all, it's not like re suddenly stopped working :-) > I'm really hoping someone has more actual technical understanding of > re vs. regex and can give us some facts about the differences, rather > than, frankly, FUD. The good part is that it's based on the re code, the FUD comes from the fact that the new lib is 380kB larger than the old one and that's not even counting the generated 500kB of lookup tables. If no one steps up to do a review or analysis, I think the only practical way to test the lib is to give it a prominent chance to prove itself. The other aspect is maintenance. Perhaps we could have a summer of code student do a review and analysis to get familiar with the code and then have at least two developers know the code well enough to support it for a while. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From python at mrabarnett.plus.com Sat Aug 27 01:21:03 2011 From: python at mrabarnett.plus.com (MRAB) Date: Sat, 27 Aug 2011 00:21:03 +0100 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <8653.1314400096@chthon> References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> <8653.1314400096@chthon> Message-ID: <4E582A5F.7060804@mrabarnett.plus.com> On 27/08/2011 00:08, Tom Christiansen wrote: > "M.-A. Lemburg" wrote > on Sat, 27 Aug 2011 01:00:31 +0200: > >> The good part is that it's based on the re code, the FUD comes >> from the fact that the new lib is 380kB larger than the old one >> and that's not even counting the generated 500kB of lookup >> tables. > > Well, you have to put the property tables somewhere, somehow. > There are various schemes for demand loading them as needed, > but I don't know whether those are used. > FYI, the .pyd for Python v3.2 is 227KB, about half of which is property tables. From guido at python.org Sat Aug 27 01:29:02 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 16:29:02 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E582A5F.7060804@mrabarnett.plus.com> References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> <8653.1314400096@chthon> <4E582A5F.7060804@mrabarnett.plus.com> Message-ID: On Fri, Aug 26, 2011 at 4:21 PM, MRAB wrote: > On 27/08/2011 00:08, Tom Christiansen wrote: >> >> "M.-A. Lemburg" ?wrote >> ? ?on Sat, 27 Aug 2011 01:00:31 +0200: >> >>> The good part is that it's based on the re code, the FUD comes >>> from the fact that the new lib is 380kB larger than the old one >>> and that's not even counting the generated 500kB of lookup >>> tables. >> >> Well, you have to put the property tables somewhere, somehow. >> There are various schemes for demand loading them as needed, >> but I don't know whether those are used. >> > FYI, the .pyd for Python v3.2 is 227KB, about half of which is property > tables. I wouldn't hold the size of the generated tables against you. :-) -- --Guido van Rossum (python.org/~guido) From tchrist at perl.com Sat Aug 27 01:08:16 2011 From: tchrist at perl.com (Tom Christiansen) Date: Fri, 26 Aug 2011 17:08:16 -0600 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E58258F.9050204@egenix.com> References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> Message-ID: <8653.1314400096@chthon> "M.-A. Lemburg" wrote on Sat, 27 Aug 2011 01:00:31 +0200: > The good part is that it's based on the re code, the FUD comes > from the fact that the new lib is 380kB larger than the old one > and that's not even counting the generated 500kB of lookup > tables. Well, you have to put the property tables somewhere, somehow. There are various schemes for demand loading them as needed, but I don't know whether those are used. --tom From tjreedy at udel.edu Sat Aug 27 00:57:37 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 26 Aug 2011 18:57:37 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E576793.2010203@v.loewis.de> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> Message-ID: <4E5824E1.9010101@udel.edu> On 8/26/2011 5:29 AM, "Martin v. L?wis" wrote: >> IronPython and Jython can retain UTF-16 as their native form if that >> makes interop cleaner, but in doing so they need to ensure that basic >> operations like indexing and len work in terms of code points, not >> code units, if they are to conform. My impression is that a UFT-16 implementation, to be properly called such, must do len and [] in terms of code points, which is why Python's narrow builds are called UCS-2 and not UTF-16. > That means that they won't conform, period. There is no efficient > maintainable implementation strategy to achieve that property, Given that both 'efficient' and 'maintainable' are relative terms, that is you pessimistic opinion, not really a fact. > it may take well years until somebody provides an efficient > unmaintainable implementation. > >> Does this make sense, or have I completely misunderstood things? > > You seem to assume it is ok for Jython/IronPython to provide indexing in > O(n). It is not. Why do you keep saying that O(n) is the alternative? I have already given a simple solution that is O(logk), where k is the number of non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I suspect that is the most time efficient possible without using at least as much space as a UCS-4 solution. The fact that you and other do not want this for CPython should not preclude other implementations that are more tied to UTF-16 from exploring the idea. Maintainability partly depends on whether all-codepoint support is built in or bolted on to a BMP-only implementation burdened with back compatibility for a code unit API. Maintainability is probably harder with a separate UTF-32 type, which CPython has but which I gather Jython and Iron-Python do not. It might or might not be easier is there were a separate internal character type containing a 32 bit code point value, so that interation and indexing (and single char slicing) always returned the same type of object regardless of whether the character was in the BMP or not. This certainly would help all the unicode database functions. Tom Christiansen appears to have said that Perl is or will use UTF-8 plus auxiliary arrays. If so, we will find out if they can maintain it. --- Terry Jan Reedy From solipsis at pitrou.net Sat Aug 27 02:03:21 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 02:03:21 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E581986.3000709@egenix.com> <20110827003359.43416085@pitrou.net> Message-ID: <20110827020321.64061f94@pitrou.net> On Fri, 26 Aug 2011 15:47:21 -0700 Guido van Rossum wrote: > > The best way would be to contact the author, Matthew Barnett, > > I had added him to the beginning of this thread but someone took him off. > > > or to ask > > on the tracker on http://bugs.python.org/issue2636. He has been quite > > willing to answer such questions in the past, AFAIR. > > So, that issue is about something called "regexp". AFAIK Matthew > (MRAB) wrote something called "regex" > (http://pypi.python.org/pypi/regex). Are they two different things??? No, it's the same. The source is at https://code.google.com/p/mrab-regex-hg/, btw. Regards Antoine. From solipsis at pitrou.net Sat Aug 27 02:06:35 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 02:06:35 +0200 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> Message-ID: <20110827020635.272d75bd@pitrou.net> On Sat, 27 Aug 2011 01:00:31 +0200 "M.-A. Lemburg" wrote: > > > > I can't say I liked how that transition was handled last time around. > > I really don't want to have to tell people "Oh, that bug is fixed but > > you have to use regex instead of re" and then a few years later have > > to tell them "Oh, we're deprecating regex, you should just use re". > > No, you tell them: "If you want Unicode 6 semantics, use regex, > if you're fine with Unicode 2.0/3.0 semantics, use re". After all, > it's not like re suddenly stopped working :-) It has a whole lot of new features in addition to better unicode support. See for yourself: https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails > Perhaps we could have a summer of code student do a review and > analysis to get familiar with the code and then have at least > two developers know the code well enough to support it for > a while. I'm not sure a GSoC student would be the best candidate to do a review matching our expectations. Regards Antoine. From solipsis at pitrou.net Sat Aug 27 02:08:35 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 02:08:35 +0200 Subject: [Python-Dev] Should we move to replace re with regex? References: Message-ID: <20110827020835.08a2a492@pitrou.net> On Fri, 26 Aug 2011 15:48:42 -0700 Dan Stromberg wrote: > > Then there probably should be a from __future__ import for a while. If you are willing to use a "from __future__ import", why not simply import regex as re ? We're not Perl, we don't have built-in syntactic support for regular expressions. Regards Antoine. From greg.ewing at canterbury.ac.nz Sat Aug 27 02:17:18 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 27 Aug 2011 12:17:18 +1200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E58378E.2090609@canterbury.ac.nz> Paul Moore wrote: > IronPython and Jython can retain UTF-16 as their native form if that > makes interop cleaner, but in doing so they need to ensure that basic > operations like indexing and len work in terms of code points, not > code units, if they are to conform. ... They lose the O(1) > guarantee, but that's easily defensible as a tradeoff to conform to > underlying runtime semantics. I would only agree as long as it wasn't too much worse than O(1). O(log n) might be all right, but O(n) would be unacceptable, I think. -- Greg From drsalists at gmail.com Sat Aug 27 02:25:56 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Fri, 26 Aug 2011 17:25:56 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110827020835.08a2a492@pitrou.net> References: <20110827020835.08a2a492@pitrou.net> Message-ID: On Fri, Aug 26, 2011 at 5:08 PM, Antoine Pitrou wrote: > On Fri, 26 Aug 2011 15:48:42 -0700 > Dan Stromberg wrote: > > > > Then there probably should be a from __future__ import for a while. > > If you are willing to use a "from __future__ import", why not simply > > import regex as re > > ? We're not Perl, we don't have built-in syntactic support for regular > expressions. > > Regards > If you add regex as "import regex", and the new regex module doesn't work out, regex might be harder to get rid of. from __future__ import is an established way of trying something for a while to see if it's going to work. EG: "from __future__ import re", where re is really the new module. But whatever. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sat Aug 27 02:23:31 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 02:23:31 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E58378E.2090609@canterbury.ac.nz> Message-ID: <20110827022331.0d99a22c@pitrou.net> On Sat, 27 Aug 2011 12:17:18 +1200 Greg Ewing wrote: > Paul Moore wrote: > > > IronPython and Jython can retain UTF-16 as their native form if that > > makes interop cleaner, but in doing so they need to ensure that basic > > operations like indexing and len work in terms of code points, not > > code units, if they are to conform. ... They lose the O(1) > > guarantee, but that's easily defensible as a tradeoff to conform to > > underlying runtime semantics. > > I would only agree as long as it wasn't too much worse > than O(1). O(log n) might be all right, but O(n) would be > unacceptable, I think. It also depends a lot on *actual* measured performance. As someone mentioned in the tracker, the index you use on a string usually comes from a previous string operation (like a search), perhaps with a small offset. So a caching scheme may actually give very good results with a rather small overhead (you could cache, say, the 4 most recent indices and choose the nearest when an indexing operation is done; with utf-8, scanning backward and forward is equally simple). Regards Antoine. From greg.ewing at canterbury.ac.nz Sat Aug 27 02:34:48 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 27 Aug 2011 12:34:48 +1200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E575F31.5010709@egenix.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <87sjoqzs3a.fsf@uwakimon.sk.tsukuba.ac.jp> <87liuixrfh.fsf@uwakimon.sk.tsukuba.ac.jp> <4E55C2C3.3060205@canterbury.ac.nz> <4E575F31.5010709@egenix.com> Message-ID: <4E583BA8.5080406@canterbury.ac.nz> M.-A. Lemburg wrote: > Simply going with UCS-4 does not solve the problem, since > even with UCS-4 storage, you can still have surrogates in your > Python Unicode string. Yes, but in that case, you presumably *intend* them to be treated as separate indexing units. If you didn't, there would be no need to use surrogates in the first place. -- Greg From guido at python.org Sat Aug 27 02:42:32 2011 From: guido at python.org (Guido van Rossum) Date: Fri, 26 Aug 2011 17:42:32 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5824E1.9010101@udel.edu> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> Message-ID: On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy wrote: > > > On 8/26/2011 5:29 AM, "Martin v. L?wis" wrote: >>> >>> IronPython and Jython can retain UTF-16 as their native form if that >>> makes interop cleaner, but in doing so they need to ensure that basic >>> operations like indexing and len work in terms of code points, not >>> code units, if they are to conform. > > My impression is that a UFT-16 implementation, to be properly called such, > must do len and [] in terms of code points, which is why Python's narrow > builds are called UCS-2 and not UTF-16. I don't think anyone else has that impression. Please cite chapter and verse if you really think this is important. IIUC, UCS-2 does not allow surrogate pairs, whereas Python (and Java, and .NET, and Windows) 16-bit strings all do support surrogate pairs. And they all have a len or length function that counts code units, not code points. >> That means that they won't conform, period. There is no efficient >> maintainable implementation strategy to achieve that property, > > Given that both 'efficient' and 'maintainable' are relative terms, that is > you pessimistic opinion, not really a fact. > >> it may take well years until somebody provides an efficient >> unmaintainable implementation. >> >>> Does this make sense, or have I completely misunderstood things? >> >> You seem to assume it is ok for Jython/IronPython to provide indexing in >> O(n). It is not. > > Why do you keep saying that O(n) is the alternative? I have already given a > simple solution that is O(logk), where k is the number of non-BMP > characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise > (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I > suspect that is the most time efficient possible without using at least as > much space as a UCS-4 solution. The fact that you and other do not want this > for CPython should not preclude other implementations that are more tied to > UTF-16 from exploring the idea. > > Maintainability partly depends on whether all-codepoint support is built in > or bolted on to a BMP-only implementation burdened with back compatibility > for a code unit API. Maintainability is probably harder with a separate > UTF-32 type, which CPython has but which I gather Jython and Iron-Python do > not. It might or might not be easier is there were a separate internal > character type containing a 32 bit code point value, so that interation and > indexing (and single char slicing) always returned the same type of object > regardless of whether the character was in the BMP or not. This certainly > would help all the unicode database functions. > > Tom Christiansen appears to have said that Perl is or will use UTF-8 plus > auxiliary arrays. If so, we will find out if they can maintain it. Their API style is completely different from ours. What Perl can maintain has little bearing on what Python can. -- --Guido van Rossum (python.org/~guido) From ben+python at benfinney.id.au Sat Aug 27 03:22:58 2011 From: ben+python at benfinney.id.au (Ben Finney) Date: Sat, 27 Aug 2011 11:22:58 +1000 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> Message-ID: <87mxevvea5.fsf@benfinney.id.au> "M.-A. Lemburg" writes: > Guido van Rossum wrote: > > I really don't want to have to tell people "Oh, that bug is fixed > > but you have to use regex instead of re" and then a few years later > > have to tell them "Oh, we're deprecating regex, you should just use > > re". > > No, you tell them: "If you want Unicode 6 semantics, use regex, if > you're fine with Unicode 2.0/3.0 semantics, use re". What do we say, then, to those who are unaware of the different semantics between those versions of Unicode, and want regular expression to ?just work? in Python? To which document can we direct them to understand what semantics they want? > After all, it's not like re suddenly stopped working :-) For some value of ?working?, that is. The trick is to know whether that value is what one wants. -- \ ?The fact of your own existence is the most astonishing fact | `\ you'll ever have to confront. Don't dare ever see your life as | _o__) boring, monotonous, or joyless.? ?Richard Dawkins, 2010-03-10 | Ben Finney From ezio.melotti at gmail.com Sat Aug 27 03:37:21 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Sat, 27 Aug 2011 04:37:21 +0300 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E582432.2080301@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 1:57 AM, Guido van Rossum wrote: > On Fri, Aug 26, 2011 at 3:54 PM, "Martin v. L?wis" > wrote: > > [...] > > Among us, some are more "regex gurus" than others; you know > > who you are. I guess the PSF would pay for the review, if that > > is what it would take. > > Makes sense. I noticed Ezio seems quite in favor of regex. Maybe he knows > more? > Matthew has always been responsive on the tracker, usually fixing reported bugs in a matter of days, and I think he's willing to keep doing so once the regex module is included. Even if I haven't yet tried the module myself (I'm planning to do it though), it seems quite popular out there (the download number on PyPI apparently gets reset for each new release, so I don't know the exact total), and apparently people are already using it as a replacement of re. I'm not sure it's worth doing an extensive review of the code, a better approach might be to require extensive test coverage (and a review of tests). If the code seems well written, commented, documented (I think proper rst documentation is still missing), and tested (both with unittest and out in the wild), and Matthew is willing to maintain it, I think we can include it. We will get familiar with the code once we start contributing to it and fixing bugs, as it already happens with most of the other modules. See also the "New regex module for 3.2?" thread ( http://mail.python.org/pipermail/python-dev/2010-July/101606.html ). Best Regards, Ezio Melotti > > -- > --Guido van Rossum (python.org/~guido ) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Aug 27 03:54:26 2011 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Aug 2011 11:54:26 +1000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <87mxevvea5.fsf@benfinney.id.au> References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> <87mxevvea5.fsf@benfinney.id.au> Message-ID: <4E584E52.1080606@pearwood.info> Ben Finney wrote: > "M.-A. Lemburg" writes: >> No, you tell them: "If you want Unicode 6 semantics, use regex, if >> you're fine with Unicode 2.0/3.0 semantics, use re". > > What do we say, then, to those who are unaware of the different > semantics between those versions of Unicode, and want regular expression > to ?just work? in Python? > > To which document can we direct them to understand what semantics they > want? Presumably, like all modules, both the re and the regex module will have their own individual pages in the library reference. As the newcomer, regex should include a discussion of differences between the two. This can then be quietly dropped once re becomes formally deprecated. (Assuming that the std lib keeps re and regex in parallel for a few releases, which is not a given.) However, I note that last time, the old regex module was just documented as obsolete with little detailed discussion of the differences: http://docs.python.org/release/1.5/lib/node69.html#SECTION005300000000000000000 -- Steven From solipsis at pitrou.net Sat Aug 27 03:56:02 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 03:56:02 +0200 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E582432.2080301@v.loewis.de> Message-ID: <20110827035602.557f772f@pitrou.net> On Sat, 27 Aug 2011 04:37:21 +0300 Ezio Melotti wrote: > > I'm not sure it's worth doing an extensive review of the code, a better > approach might be to require extensive test coverage (and a review of > tests). If the code seems well written, commented, documented (I think > proper rst documentation is still missing), Isn't this precisely what a review is supposed to assess? > We will get familiar with the code once we start contributing > to it and fixing bugs, as it already happens with most of the other modules. I'm not sure it's a good idea for a module with more than 10000 lines of C code (and 4000 lines of pure Python code). This is several times the size of multiprocessing. The C code looks very cleanly written, but it's still a big chunk of algorithmically sophisticated code. Another "interesting" question is whether it's easy to port to the PEP 393 string representation, if it gets accepted. Regards Antoine. From solipsis at pitrou.net Sat Aug 27 03:59:16 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 03:59:16 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> Message-ID: <20110827035916.583c3d81@pitrou.net> On Fri, 26 Aug 2011 17:25:56 -0700 Dan Stromberg wrote: > On Fri, Aug 26, 2011 at 5:08 PM, Antoine Pitrou wrote: > > > On Fri, 26 Aug 2011 15:48:42 -0700 > > Dan Stromberg wrote: > > > > > > Then there probably should be a from __future__ import for a while. > > > > If you are willing to use a "from __future__ import", why not simply > > > > import regex as re > > > > ? We're not Perl, we don't have built-in syntactic support for regular > > expressions. > > > > Regards > > > > If you add regex as "import regex", and the new regex module doesn't work > out, regex might be harder to get rid of. from __future__ import is an > established way of trying something for a while to see if it's going to > work. That's an interesting idea. This way, integrating the new module would be a less risky move, since if it gives us too many problems, we could back out our decision in the next feature release. Regards Antoine. From ben+python at benfinney.id.au Sat Aug 27 05:15:18 2011 From: ben+python at benfinney.id.au (Ben Finney) Date: Sat, 27 Aug 2011 13:15:18 +1000 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> <87mxevvea5.fsf@benfinney.id.au> <4E584E52.1080606@pearwood.info> Message-ID: <87fwknv92x.fsf@benfinney.id.au> Steven D'Aprano writes: > Ben Finney wrote: > > "M.-A. Lemburg" writes: > > >> No, you tell them: "If you want Unicode 6 semantics, use regex, if > >> you're fine with Unicode 2.0/3.0 semantics, use re". > > > > What do we say, then, to those who are unaware of the different > > semantics between those versions of Unicode, and want regular expression > > to ?just work? in Python? > > > > To which document can we direct them to understand what semantics they > > want? > > Presumably, like all modules, both the re and the regex module will > have their own individual pages in the library reference. My question is directed more to M-A Lemburg's passage above, and its implicit assumption that the user understand the changes between ?Unicode 2.0/3.0 semantics? and ?Unicode 6 semantics?, and how their own needs relate to those semantics. For programmers who know they want to follow Unicode conventions in Python, but don't know the distinction M-A Lemburg is drawing, to which document does he recommend we direct them? ?The Unicode specification document in its various versions? isn't a feasible answer. -- \ ?Computers are useless. They can only give you answers.? ?Pablo | `\ Picasso | _o__) | Ben Finney From steve at pearwood.info Sat Aug 27 05:31:03 2011 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Aug 2011 13:31:03 +1000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <87fwknv92x.fsf@benfinney.id.au> References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> <87mxevvea5.fsf@benfinney.id.au> <4E584E52.1080606@pearwood.info> <87fwknv92x.fsf@benfinney.id.au> Message-ID: <4E5864F7.2010106@pearwood.info> Ben Finney wrote: > Steven D'Aprano writes: > >> Ben Finney wrote: >>> "M.-A. Lemburg" writes: >>>> No, you tell them: "If you want Unicode 6 semantics, use regex, if >>>> you're fine with Unicode 2.0/3.0 semantics, use re". >>> What do we say, then, to those who are unaware of the different >>> semantics between those versions of Unicode, and want regular expression >>> to ?just work? in Python? >>> >>> To which document can we direct them to understand what semantics they >>> want? >> Presumably, like all modules, both the re and the regex module will >> have their own individual pages in the library reference. > > My question is directed more to M-A Lemburg's passage above, and its > implicit assumption that the user understand the changes between > ?Unicode 2.0/3.0 semantics? and ?Unicode 6 semantics?, and how their own > needs relate to those semantics. > > For programmers who know they want to follow Unicode conventions in > Python, but don't know the distinction M-A Lemburg is drawing, to which > document does he recommend we direct them? I can only repeat my answer: the docs for the new regex module should include a discussion of the differences. If that requires summarising the differences that M-A Lemburg refers to, then so be it. > ?The Unicode specification document in its various versions? isn't a > feasible answer. Presumably the Unicode spec will be the canonical source, but I agree that we should not expect people to read that in order to make a decision between re and regex. -- Steven From steve at pearwood.info Sat Aug 27 05:47:34 2011 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Aug 2011 13:47:34 +1000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110827035916.583c3d81@pitrou.net> References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> Message-ID: <4E5868D6.8090203@pearwood.info> Antoine Pitrou wrote: > On Fri, 26 Aug 2011 17:25:56 -0700 > Dan Stromberg wrote: [...] >> If you add regex as "import regex", and the new regex module doesn't work >> out, regex might be harder to get rid of. from __future__ import is an >> established way of trying something for a while to see if it's going to >> work. > > That's an interesting idea. This way, integrating the new module would > be a less risky move, since if it gives us too many problems, we could > back out our decision in the next feature release. I'm not sure that's correct. If there are differences in either the interface or the behaviour between the new regex and re, then reverting will be a pain regardless of whether you have: from __future__ import re re.compile(...) or import regex regex.compile(...) Either way, if the new regex library goes away, code will break, and fixing it may not be easy. It's not likely to be so easy that merely deleting the "from __future__ ..." line will do it, but if it is that easy, then using "import re as regex" will be just as easy. Have then been any __future__ features that were added provisionally? I can't think of any. That's not what __future__ is for, at least according to PEP 236. http://www.python.org/dev/peps/pep-0236/ I can't think of any __future__ feature that could be easily reverted once people start relying on it. Either syntax would break, or behaviour would change. The PEP even explicitly states that __future__ should not be used for changes which are backward compatible: Note that there is no need to involve the future_statement machinery in new features unless they can break existing code; fully backward- compatible additions can-- and should --be introduced without a corresponding future_statement. I wasn't around for the move from 1.4 regex to 1.5 re, so I don't know what was done poorly last time. But I can't see why we should treat regular expressions so differently from (say) argparse and optparse. from __future__ import optparse No. Just... no. -- Steven From tjreedy at udel.edu Sat Aug 27 05:51:30 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 26 Aug 2011 23:51:30 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> Message-ID: <4E5869C2.2040008@udel.edu> On 8/26/2011 8:42 PM, Guido van Rossum wrote: > On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy wrote: >> My impression is that a UFT-16 implementation, to be properly called such, >> must do len and [] in terms of code points, which is why Python's narrow >> builds are called UCS-2 and not UTF-16. > > I don't think anyone else has that impression. Please cite chapter and > verse if you really think this is important. IIUC, UCS-2 does not > allow surrogate pairs, whereas Python (and Java, and .NET, and > Windows) 16-bit strings all do support surrogate pairs. And they all For that reason, I think UTF-16 is a better term that UCS-2 for narrow builds (whether or not the above impression is true). But Marc Lemburg disagrees. http://mail.python.org/pipermail/python-dev/2010-November/105751.html The 2.7 docs still refer to usc2 builds, as is his wish. --- Terry Jan Reedy From g.brandl at gmx.net Sat Aug 27 07:47:35 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 27 Aug 2011 07:47:35 +0200 Subject: [Python-Dev] Sphinx version for Python 2.x docs In-Reply-To: References: <4E4AF610.5040303@simplistix.co.uk> Message-ID: Am 23.08.2011 01:09, schrieb Sandro Tosi: > Hi all, > >> Any chance the version of sphinx used to generate the docs on >> docs.python.org could be updated? > > I'd like to discuss this aspect, in particular for the implication it > has on http://bugs.python.org/issue12409 . > > Personally, I do think it has a value to have the same set of tools to > build the Python documentation of the currently active branches. > Currently, only 2.7 is different, since it still fetches (from > svn.python.org... can we fix this too? suggestions welcome!) sphinx > 0.6.7 while 3.2/3.3 uses 1.0.7. > > If you're worried about the time needed to convert the actual 2.7 doc > to new sphinx format and all the related changes, I volunteer to do > the job (and/or collaborate with whom is already on it), but what I > want to understand if it's an acceptable change. > > I see sphinx more as of an internal, building tool, so freezing it > it's like saying "don't upgrade gcc" or so. Now the delta is just the > C functions definitions and some py-specific roles, but during the > years it will increase. Keeping it small, simplifying the forward-port > of doc patches (not needing to have 2 version between 2.7 and 3.x > f.e.) and having a common set of tools for doc building is worth IMHO. > > What do you think about it? and yes Georg, I'd like to hear your opinion too :) One of the main reasons for keeping Sphinx compatibility to 0.6.x was to enable distributions (like Debian) to build the docs for the Python they ship with the version of Sphinx that they ship. This should now be fine with 1.0.x, so since you are ready to do the work of converting the 2.7 Doc sources, it will be accepted. The argument of easier backports is a very good one. The issue of using svn to download the tools is orthogonal; for this I would agree to just packaging up a tarball or zipfile that is then downloaded using a small Python script (should be properly cross-platform then). Cloning the original repositories is a) not useful, b) depends on availability of at least two additional servers (remember docutils) and c) requires hg and svn. Georg From raymond.hettinger at gmail.com Sat Aug 27 07:58:10 2011 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Fri, 26 Aug 2011 22:58:10 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5869C2.2040008@udel.edu> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> Message-ID: <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> On Aug 26, 2011, at 8:51 PM, Terry Reedy wrote: > > > On 8/26/2011 8:42 PM, Guido van Rossum wrote: >> On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy wrote: > >>> My impression is that a UFT-16 implementation, to be properly called such, >>> must do len and [] in terms of code points, which is why Python's narrow >>> builds are called UCS-2 and not UTF-16. >> >> I don't think anyone else has that impression. Please cite chapter and >> verse if you really think this is important. IIUC, UCS-2 does not >> allow surrogate pairs, whereas Python (and Java, and .NET, and >> Windows) 16-bit strings all do support surrogate pairs. And they all > > For that reason, I think UTF-16 is a better term that UCS-2 for narrow builds (whether or not the above impression is true). I agree. It's weird to call something UCS-2 if code points above 65535 are representable. The naming convention for codecs is that the UTF prefix is used for lossless encodings that cover the entire range of Unicode. "The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP." Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From drsalists at gmail.com Sat Aug 27 08:01:21 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Fri, 26 Aug 2011 23:01:21 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E5868D6.8090203@pearwood.info> References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On Fri, Aug 26, 2011 at 8:47 PM, Steven D'Aprano wrote: > Antoine Pitrou wrote: > >> On Fri, 26 Aug 2011 17:25:56 -0700 >> Dan Stromberg wrote: >> > If you add regex as "import regex", and the new regex module doesn't work > >> out, regex might be harder to get rid of. from __future__ import is an >>> established way of trying something for a while to see if it's going to >>> work. >>> >> >> That's an interesting idea. This way, integrating the new module would >> be a less risky move, since if it gives us too many problems, we could >> back out our decision in the next feature release. >> > > I'm not sure that's correct. If there are differences in either the > interface or the behaviour between the new regex and re, then reverting will > be a pain regardless of whether you have: > > from __future__ import re > re.compile(...) > > or > > import regex > regex.compile(...) > > > Either way, if the new regex library goes away, code will break, and fixing > it may not be easy. You're talking technically, which is important, but wasn't what I was suggesting would be helped. Politically, and from a marketing standpoint, it's easier to withdraw a feature you've given with a "Play with this, see if it works for you" warning. Have then been any __future__ features that were added provisionally? > I can't either, but ISTR hearing that from __future__ import was started with such an intent. Irrespective, it's hard to import something from "future" without at least suspecting that you're on the bleeding edge. -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Sat Aug 27 08:02:31 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sat, 27 Aug 2011 08:02:31 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E582432.2080301@v.loewis.de> Message-ID: <4E588877.3080204@v.loewis.de> > I'm not sure it's worth doing an extensive review of the code, a better > approach might be to require extensive test coverage (and a review of > tests). I think it's worth. It's really bad if only one developer fully understands the regex implementation. Regards, Martin From tjreedy at udel.edu Sat Aug 27 08:25:17 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 27 Aug 2011 02:25:17 -0400 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110827022331.0d99a22c@pitrou.net> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E58378E.2090609@canterbury.ac.nz> <20110827022331.0d99a22c@pitrou.net> Message-ID: On 8/26/2011 8:23 PM, Antoine Pitrou wrote: >> I would only agree as long as it wasn't too much worse >> than O(1). O(log n) might be all right, but O(n) would be >> unacceptable, I think. > > It also depends a lot on *actual* measured performance Amen. Some regard O(n*n) sorts to be, by definition, 'worse' than O(n*logn). I even read that in an otherwise good book by a university professor. Fortunately for Python users, Tim Peters ignored that 'wisdom', coded the best O(n*n) sort he could, and then *measured* to find out what was better for what types and lengths of arrays. So not we have a list.sort that sometimes beats the pure O(nlog) quicksort of C libraries. -- Terry Jan Reedy From martin at v.loewis.de Sat Aug 27 08:31:44 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 08:31:44 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: <4E588F50.1040903@v.loewis.de> > I can't either, but ISTR hearing that from __future__ import was started > with such an intent. No, not at all. The original intention was to enable features that would definitely would be added, not just right now. Tim Peters always objected to claims that future imports were talking about provisional features. > Politically, and from a marketing standpoint, it's easier to withdraw > a feature you've given with a "Play with this, see if it works for > you" warning. We don't want to add features to Python that we may have to withdraw. If there is doubt whether they should be added, they shouldn't be added. If they do get added, we have to live with it (until, say, Python 4, where bad features can be removed again). Regards, Martin From tjreedy at udel.edu Sat Aug 27 08:33:44 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 27 Aug 2011 02:33:44 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110827035602.557f772f@pitrou.net> References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> Message-ID: On 8/26/2011 9:56 PM, Antoine Pitrou wrote: > Another "interesting" question is whether it's easy to port to the PEP > 393 string representation, if it gets accepted. Will the re module need porting also? -- Terry Jan Reedy From martin at v.loewis.de Sat Aug 27 09:18:14 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 09:18:14 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> Message-ID: <4E589A36.80109@v.loewis.de> Am 27.08.2011 08:33, schrieb Terry Reedy: > On 8/26/2011 9:56 PM, Antoine Pitrou wrote: > >> Another "interesting" question is whether it's easy to port to the PEP >> 393 string representation, if it gets accepted. > > Will the re module need porting also? That's a quality-of-implementation issue (in both cases). In principle, the modules should continue to work unmodified, and indeed SRE does. However, the module will then match on Py_UNICODE, which may be expensive to produce, and may not meet your expectations of surrogate pair handling. So realistically, the module should be ported, which has the challenge that matching needs to operate on three different representations. The modules already support two representations (unsigned char and Py_UNICODE), but probably switching on type, not on state. Regards, Martin From steve at pearwood.info Sat Aug 27 09:40:24 2011 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 27 Aug 2011 17:40:24 +1000 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E58378E.2090609@canterbury.ac.nz> <20110827022331.0d99a22c@pitrou.net> Message-ID: <4E589F68.60301@pearwood.info> Terry Reedy wrote: > On 8/26/2011 8:23 PM, Antoine Pitrou wrote: > >>> I would only agree as long as it wasn't too much worse >>> than O(1). O(log n) might be all right, but O(n) would be >>> unacceptable, I think. >> >> It also depends a lot on *actual* measured performance > > Amen. Some regard O(n*n) sorts to be, by definition, 'worse' than > O(n*logn). I even read that in an otherwise good book by a university > professor. Fortunately for Python users, Tim Peters ignored that > 'wisdom', coded the best O(n*n) sort he could, and then *measured* to > find out what was better for what types and lengths of arrays. So not we > have a list.sort that sometimes beats the pure O(nlog) quicksort of C > libraries. A nice story, but Quicksort's worst case is O(n*n) too. http://en.wikipedia.org/wiki/Quicksort timsort is O(n) in the best case (all items already in order). You are right though about Tim Peters doing extensive measurements: http://bugs.python.org/file4451/timsort.txt If you haven't read the whole thing, do so. I am in awe -- not just because he came up with the algorithm, but because of the discipline Tim demonstrated in such detailed testing. A far cry from a couple of timeit runs on short-ish lists. -- Steven From martin at v.loewis.de Sat Aug 27 09:59:03 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 09:59:03 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E589F68.60301@pearwood.info> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E58378E.2090609@canterbury.ac.nz> <20110827022331.0d99a22c@pitrou.net> <4E589F68.60301@pearwood.info> Message-ID: <4E58A3C7.6050301@v.loewis.de> Am 27.08.2011 09:40, schrieb Steven D'Aprano: > Terry Reedy wrote: >> On 8/26/2011 8:23 PM, Antoine Pitrou wrote: >> >>>> I would only agree as long as it wasn't too much worse >>>> than O(1). O(log n) might be all right, but O(n) would be >>>> unacceptable, I think. >>> >>> It also depends a lot on *actual* measured performance >> >> Amen. Some regard O(n*n) sorts to be, by definition, 'worse' than >> O(n*logn). I even read that in an otherwise good book by a university >> professor. Fortunately for Python users, Tim Peters ignored that >> 'wisdom', coded the best O(n*n) sort he could, and then *measured* to >> find out what was better for what types and lengths of arrays. So not >> we have a list.sort that sometimes beats the pure O(nlog) quicksort of >> C libraries. > > A nice story, but Quicksort's worst case is O(n*n) too. In addition, timsort is O(n log n), which also makes it a real good O(n*n) sort :-) Regards, Martin From ncoghlan at gmail.com Sat Aug 27 10:02:49 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 27 Aug 2011 18:02:49 +1000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On Sat, Aug 27, 2011 at 4:01 PM, Dan Stromberg wrote: > You're talking technically, which is important, but wasn't what I was > suggesting would be helped. > > Politically, and from a marketing standpoint, it's easier to withdraw a > feature you've given with a "Play with this, see if it works for you" > warning. The standard library isn't for playing. "pip install regex" is for playing. If we aren't sure we want to make the transition, then it doesn't go in. However, to my mind, reviewing and incorporating regex is a far more feasible model than trying to enhance the existing re module with a comparable feature set. At the moment, there's already an obvious way to get enhanced regex support in Python: install regex and use it instead of the standard library's re module. That's enough to pretty much kill any motivation anyone might have to make major changes to re itself. We're at least getting one thing right this time that we got wrong with multiprocessing, though - we're much, much further out from the 3.3 release than we were from the 2.6 release when multiprocessing was added to the standard library :) The next step needed is for someone to volunteer to write and champion a PEP that: - articulates the deficiencies in the current re module (the regex docs already cover some of this, as do Tom Christiansen's notes on the issue tracker) - explains why upgrading re in place is not feasible (e.g. noting that the availability of regex really limits the desire for anyone to reinvent that particular wheel, so even things that are theoretically possible may be highly unlikely in practice) - proposes a transition plan (personally, I'd be fine with an optparse -> argparse style transition where re remains around indefinitely to support legacy code, but new users are pointed towards regex. But depending on compatibility details, merging the two APIs in the existing re namespace may also be feasible) - proposes a maintenance strategy (I don't know how much Matthew has written regarding internal design details, but that kind of thing could really help. Matthew agreeing to continue maintenance as part of the standard library would also help a great deal, but wouldn't be enough on its own - while it's good for modules to have active maintainers to make the final call associated design decisions, it's potentially problematic when other core developers don't understand what the code is doing well enough to fix bugs in it) - confirms that the regex test suite can be incorporated cleanly into the standard library regression test suite (the difficulty of this was something that was underestimated for the inclusion of multiprocessing. Test suite integration is also the final sticking point holding up the PEP 380 'yield from' patch, although that's close to being resolved following the PyConAU sprints) - document tests conducted (e.g. micro-benchmark results, fusil results) PEP 371 (addition of multiprocessing), PEP 389 (addition of argparse) and Jesse's reflections on the way multiprocessing was added (http://jessenoller.com/2009/01/28/multiprocessing-in-hindsight/) are well worth reading for anyone considering stepping up to write a PEP. That last also highlights why even Matthew's support, however capably he has handled maintenance of regex as an independent project, wouldn't be enough - we had Richard Oudkerk's support and agreement to continue maintenance as the original author of multiprocessing, but he became unavailable early in the integration process. If Jesse hadn't been able to take up most of that slack, the likely result would have been reversion of the changes and removal of multiprocessing from the 2.6 release. Writing PEPs can be quite a frustrating experience (since a lot of feedback will be negative as people try to poke holes in the idea to see if it stands up to close scrutiny), but it's also really satisfying and rewarding if they end up getting accepted and incorporated :) >> Have then been any __future__ features that were added provisionally? > > I can't either, but ISTR hearing that from __future__ import was started > with such an intent.? Irrespective, it's hard to import something from > "future" without at least suspecting that you're on the bleeding edge. No, we make an explicit guarantee that future imports will never go away once they've been added. They may become redundant, but they won't break. There's no provision in the future mechanism for changes that are added and then later removed (see http://docs.python.org/dev/library/__future__). They're strictly for cases where backwards incompatibilities (usually, but not always, new keywords) may break existing code. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ask at celeryproject.org Sat Aug 27 11:59:16 2011 From: ask at celeryproject.org (Ask Solem) Date: Sat, 27 Aug 2011 10:59:16 +0100 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: <20110826175336.3af6be57@pitrou.net> References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> Message-ID: On 26 Aug 2011, at 16:53, Antoine Pitrou wrote: > > Hi, > >> I think that "deprecating" the use of threads w/ multiprocessing - or >> at least crippling it is the wrong answer. Multiprocessing needs the >> helper threads it uses internally to manage queues, etc. Removing that >> ability would require a near-total rewrite, which is just a >> non-starter. > > I agree that this wouldn't actually benefit anyone. > Besides, I don't think it's even possible to avoid threads in > multiprocessing, given the various constraints. We would have to force > the user to run their main thread in an event loop, and that would be > twisted (tm). > >> I would focus on the atfork() patch more directly, ignoring >> multiprocessing in the discussion, and focusing on the merits of gps' >> initial proposal and patch. > > I think this could also be combined with Charles-Fran?ois' patch. > > Regards Have to agree with Jesse and Antoine here. Celery (celeryproject.org) uses multiprocessing, is wildly used in production, and is regarded as stable software that have been known to run for months at a time only to be restarted for software upgrades. I have been investigating an issue for some time, that I'm pretty sure is caused by this. It occurs only rarely, so rarely I have not had any actual bug reports about it, it's just something I have experienced during extensive testing. The tone of the discussion on the bug tracker makes me think that I have been very lucky :-) Using the fork+exec approach seems like a much more realistic solution than rewriting multiprocessing.Pool and Manager to not use threads. In fact this is something I have been considering as a fix for the suspected issue for for some time. It does have implications that are annoying for sure, but we are already used to this on the Windows platform (it could help portability even). -- Ask Solem twitter.com/asksol | +44 (0)7713357179 From solipsis at pitrou.net Sat Aug 27 12:09:29 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 12:09:29 +0200 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> <4E589A36.80109@v.loewis.de> Message-ID: <20110827120929.2600c3e9@pitrou.net> On Sat, 27 Aug 2011 09:18:14 +0200 "Martin v. L?wis" wrote: > Am 27.08.2011 08:33, schrieb Terry Reedy: > > On 8/26/2011 9:56 PM, Antoine Pitrou wrote: > > > >> Another "interesting" question is whether it's easy to port to the PEP > >> 393 string representation, if it gets accepted. > > > > Will the re module need porting also? > > That's a quality-of-implementation issue (in both cases). In principle, > the modules should continue to work unmodified, and indeed SRE does. > However, the module will then match on Py_UNICODE, which may be > expensive to produce, and may not meet your expectations of surrogate > pair handling. > > So realistically, the module should be ported, which has the challenge > that matching needs to operate on three different representations. The > modules already support two representations (unsigned char and > Py_UNICODE), but probably switching on type, not on state. From what I've seen, re generates two different sets of functions at compile-time (with a stringlib-like approach), while regex has a run-time flag to choose between the two representations (where, interestingly, the two code paths are explicitly spelled, almost duplicate of each other). Matthew, please correct me if I'm wrong. Regards Antoine. From solipsis at pitrou.net Sat Aug 27 12:10:12 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 12:10:12 +0200 Subject: [Python-Dev] Should we move to replace re with regex? References: <4E582432.2080301@v.loewis.de> <4E588877.3080204@v.loewis.de> Message-ID: <20110827121012.37b39947@pitrou.net> On Sat, 27 Aug 2011 08:02:31 +0200 "Martin v. L?wis" wrote: > > I'm not sure it's worth doing an extensive review of the code, a better > > approach might be to require extensive test coverage (and a review of > > tests). > > I think it's worth. It's really bad if only one developer fully > understands the regex implementation. Could such a review be the topic of an informational PEP? Regards Antoine. From arigo at tunes.org Sat Aug 27 12:45:05 2011 From: arigo at tunes.org (Armin Rigo) Date: Sat, 27 Aug 2011 12:45:05 +0200 Subject: [Python-Dev] Software Transactional Memory for Python Message-ID: Hi all, About multithreading models: I recently made an observation which might be obvious to some, but not to me, and as far as I know not to most of us either. I think that it's worth being pointed out :-) http://mail.python.org/pipermail/pypy-dev/2011-August/008153.html A bient?t, Armin. From exarkun at twistedmatrix.com Sat Aug 27 13:11:29 2011 From: exarkun at twistedmatrix.com (exarkun at twistedmatrix.com) Date: Sat, 27 Aug 2011 11:11:29 -0000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: Message-ID: <20110827111129.1808.139401277.divmod.xquotient.9@localhost.localdomain> On 26 Aug, 09:45 pm, guido at python.org wrote: >I just made a pass of all the Unicode-related bugs filed by Tom >Christiansen, and found that in several, the response was "this is >fixed in the regex module [by Matthew Barnett]". I started replying >that I thought that we should fix the bugs in the re module (i.e., >really in _sre.c) but on second thought I wonder if maybe regex is >mature enough to replace re in Python 3.3. It would mean that we won't >fix any of these bugs in earlier Python versions, but I could live >with that. > >However, I don't know much about regex -- how compatible is it, how >fast is it (including extreme cases where the backtracking goes >crazy), how bug-free is it, and so on. Plus, how much work would it be >to actually incorporate it into CPython as a complete drop-in >replacement of the re package (such that nobody needs to change their >imports or the flags they pass to the re module). > >We'd also probably have to train some core developers to be familiar >enough with the code to maintain and evolve it -- I assume we can't >just volunteer Matthew to do so forever... :-) > >What's the alternative? Is adding the requested bug fixes and new >features to _sre.c really that hard? What about other Python implementations (ie, PEP 399)? For this to be seriously considered, shouldn't there also be a pure Python implementation of the functionality? Jean-Paul From ncoghlan at gmail.com Sat Aug 27 14:40:35 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 27 Aug 2011 22:40:35 +1000 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: On Sat, Aug 27, 2011 at 8:45 PM, Armin Rigo wrote: > Hi all, > > About multithreading models: I recently made an observation which > might be obvious to some, but not to me, and as far as I know not to > most of us either. ?I think that it's worth being pointed out :-) > > http://mail.python.org/pipermail/pypy-dev/2011-August/008153.html Having a context manager to say "don't release the GIL" for a bit could actually be really nice (e.g. for implementing builtin-style method semantics for data types written in Python). However, two immediate questions come to mind: 1. How does the patch interact with C code that explicitly releases the GIL? (e.g. IO commands inside a "with atomic:" block) 2. Whether or not Jython and IronPython could implement something like that, since they're free threaded with fine-grained locks. If they can't then I don't see how we could justify making it part of the standard library. Interesting idea, though :) Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From arigo at tunes.org Sat Aug 27 15:08:36 2011 From: arigo at tunes.org (Armin Rigo) Date: Sat, 27 Aug 2011 15:08:36 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: Hi Nick, On Sat, Aug 27, 2011 at 2:40 PM, Nick Coghlan wrote: > 1. How does the patch interact with C code that explicitly releases > the GIL? (e.g. IO commands inside a "with atomic:" block) As implemented, any code in a "with atomic" is prevented from explicitly releasing and reacquiring the GIL: the GIL remain acquired until the end of the "with" block. In other words Py_BEGIN_ALLOW_THREADS has no effect in a "with" block. This gives semantics that, in a full multi-core STM world, would be implementable by saying that if, in the middle of a transaction, you need to do I/O, then from this point onwards the transaction is not allowed to abort any more. Such "inevitable" transactions are already supported e.g. by RSTM, the C++ framework I used to prototype a C version (https://bitbucket.org/arigo/arigo/raw/default/hack/stm/c ). > 2. Whether or not Jython and IronPython could implement something like > that, since they're free threaded with fine-grained locks. If they > can't then I don't see how we could justify making it part of the > standard library. Yes, I can imagine some solutions. I am no Jython or IronPython expert, but let us assume that they have a way to check synchronously for external events from time to time (i.e. if there is some equivalent to sys.setcheckinterval()). If they do, then all you need is the right synchronization: the thread that wants to start a "with atomic" has to wait until all other threads are paused in the external check code. (Again, like CPython's, this not a properly multi-core STM-ish solution, but it would give the right semantics. (And if it turns out that STM is successful in the future, Java will grow more direct support for it )) A bient?t, Armin. From nadeem.vawda at gmail.com Sat Aug 27 15:47:45 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Sat, 27 Aug 2011 15:47:45 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 Message-ID: Hello all, I'd like to propose the addition of a new module in Python 3.3. The 'lzma' module will provide support for compression and decompression using the LZMA algorithm, and the .xz and .lzma file formats. The matter has already been discussed on the tracker , where there seems to be a consensus that this is a desirable feature. What are your thoughts? The proposed module's API will be very similar to that of the bz2 module; the only differences will be additional keyword arguments to some functions, for specifying container formats and detailed compressor options. The implementation will also be similar to bz2 - basic compressor and decompressor classes written in C, with convenience functions and a file interface implemented on top of those in Python. I've already done some work on the C parts of the module; I'll push that to my sandbox in the next day or two. Cheers, Nadeem From rosslagerwall at gmail.com Sat Aug 27 16:36:50 2011 From: rosslagerwall at gmail.com (Ross Lagerwall) Date: Sat, 27 Aug 2011 16:36:50 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: Message-ID: <1314455810.11891.7.camel@hobo> > I'd like to propose the addition of a new module in Python 3.3. The 'lzma' > module will provide support for compression and decompression using the LZMA > algorithm, and the .xz and .lzma file formats. The matter has already been > discussed on the tracker , where there seems > to be a consensus that this is a desirable feature. What are your thoughts? > > The proposed module's API will be very similar to that of the bz2 module; > the only differences will be additional keyword arguments to some functions, > for specifying container formats and detailed compressor options. +1 for adding and +1 for keeping a similar interface. Cheers Ross From solipsis at pitrou.net Sat Aug 27 16:47:17 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 16:47:17 +0200 Subject: [Python-Dev] Software Transactional Memory for Python References: Message-ID: <20110827164717.15dbf64c@pitrou.net> On Sat, 27 Aug 2011 15:08:36 +0200 Armin Rigo wrote: > Hi Nick, > > On Sat, Aug 27, 2011 at 2:40 PM, Nick Coghlan wrote: > > 1. How does the patch interact with C code that explicitly releases > > the GIL? (e.g. IO commands inside a "with atomic:" block) > > As implemented, any code in a "with atomic" is prevented from > explicitly releasing and reacquiring the GIL: the GIL remain acquired > until the end of the "with" block. In other words > Py_BEGIN_ALLOW_THREADS has no effect in a "with" block. You then risk deadlocks. Say: - thread A is inside a "with atomic" and calls a library function which tries to take lock L - thread B has already taken lock L and is currently executing an I/O function with GIL released - thread B then waits for the GIL (and hence depends on thread A going forward), while thread A waits for lock L (and hence depends on thread B going forward) Lock L could simply be the lock used by the file object (a Buffered{Reader,Writer,Random}) which thread B is reading or writing from. Regards Antoine. From martin at v.loewis.de Sat Aug 27 16:50:02 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 16:50:02 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: Message-ID: <4E59041A.7040100@v.loewis.de> > The implementation will also be similar to bz2 - basic compressor and > decompressor classes written in C, with convenience functions and a file > interface implemented on top of those in Python. When I reviewed lzma, I found that this approach might not be appropriate. lzma has many more options and aspects that allow tuning and selection, and a Python LZMA library should provide the same feature set as the underlying C library. So I would propose that a very thin C layer is created around the C library that focuses on the actual algorithms, and that any higher layers (in particular file formats) are done in Python. Regards, Martin From nadeem.vawda at gmail.com Sat Aug 27 16:59:21 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Sat, 27 Aug 2011 16:59:21 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <4E59041A.7040100@v.loewis.de> References: <4E59041A.7040100@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 4:50 PM, "Martin v. L?wis" wrote: >> The implementation will also be similar to bz2 - basic compressor and >> decompressor classes written in C, with convenience functions and a file >> interface implemented on top of those in Python. > > When I reviewed lzma, I found that this approach might not be > appropriate. lzma has many more options and aspects that allow tuning > and selection, and a Python LZMA library should provide the same feature > set as the underlying C library. > > So I would propose that a very thin C layer is created around the C > library that focuses on the actual algorithms, and that any higher > layers (in particular file formats) are done in Python. I probably shouldn't have used the word "basic" here - these classes expose all the features of the underlying library. I was rather trying to underscore that the rest of the module is implemented in terms of these two classes. As for file formats, these are handled by liblzma itself; the extension module just selects which compressor/decompressor initializer function to use depending on the value of the "format" argument. Our code won't contain anything along the lines of GzipFile; all of that work is done by the underlying C library. Rather, the LZMAFile class will be like BZ2File - just a simple filter that passes the read/written data through a LZMACompressor or LZMADecompressor as appropriate. Cheers, Nadeem From martin at v.loewis.de Sat Aug 27 17:15:09 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sat, 27 Aug 2011 17:15:09 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> Message-ID: <4E5909FD.7060809@v.loewis.de> > As for file formats, these are handled by liblzma itself; the extension module > just selects which compressor/decompressor initializer function to use depending > on the value of the "format" argument. Our code won't contain anything along the > lines of GzipFile; all of that work is done by the underlying C library. Rather, > the LZMAFile class will be like BZ2File - just a simple filter that passes the > read/written data through a LZMACompressor or LZMADecompressor as appropriate. This is exactly what I worry about. I think adding file I/O to bz2 was a mistake, as this doesn't integrate with Python's IO library (it used to, but now after dropping stdio, they were incompatible. Indeed, for Python 3.2, BZ2File has been removed from the C module, and lifted to Python. IOW, the _lzma C module must not do any I/O, neither directly nor indirectly (through liblzma). The approach of gzip.py (doing IO and file formats in pure Python) is exactly right. Regards, Martin From ncoghlan at gmail.com Sat Aug 27 17:36:50 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Aug 2011 01:36:50 +1000 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <4E5909FD.7060809@v.loewis.de> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> Message-ID: On Sun, Aug 28, 2011 at 1:15 AM, "Martin v. L?wis" wrote: > This is exactly what I worry about. I think adding file I/O to bz2 was a > mistake, as this doesn't integrate with Python's IO library (it used > to, but now after dropping stdio, they were incompatible. Indeed, for > Python 3.2, BZ2File has been removed from the C module, and lifted to > Python. > > IOW, the _lzma C module must not do any I/O, neither directly nor > indirectly (through liblzma). The approach of gzip.py (doing IO > and file formats in pure Python) is exactly right. PEP 399 also comes into play - we need a pure Python version for PyPy et al (or a plausible story for why an exception should be granted). Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From nadeem.vawda at gmail.com Sat Aug 27 17:37:52 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Sat, 27 Aug 2011 17:37:52 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <4E5909FD.7060809@v.loewis.de> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 5:15 PM, "Martin v. L?wis" wrote: >> As for file formats, these are handled by liblzma itself; the extension module >> just selects which compressor/decompressor initializer function to use depending >> on the value of the "format" argument. Our code won't contain anything along the >> lines of GzipFile; all of that work is done by the underlying C library. Rather, >> the LZMAFile class will be like BZ2File - just a simple filter that passes the >> read/written data through a LZMACompressor or LZMADecompressor as appropriate. > > This is exactly what I worry about. I think adding file I/O to bz2 was a > mistake, as this doesn't integrate with Python's IO library (it used > to, but now after dropping stdio, they were incompatible. Indeed, for > Python 3.2, BZ2File has been removed from the C module, and lifted to > Python. > > IOW, the _lzma C module must not do any I/O, neither directly nor > indirectly (through liblzma). The approach of gzip.py (doing IO > and file formats in pure Python) is exactly right. It is not my intention for the _lzma C module to do I/O - that will be done by the LZMAFile class, which will be written in Python. My comparison with bz2 was in reference to the state of the module after it was rewritten for issue 5863. Saying "anything along the lines of GzipFile" was a bad choice of wording; what I meant is that the LZMAFile class won't handle the problem of picking apart the .xz and .lzma container formats. That is handled by liblzma (operating entirely on in-memory buffers). It will do _only_ I/O, in a similar fashion to the BZ2File class (as of changeset 2cb07a46f4b5, to avoid ambiguity ;) ). Cheers, Nadeem From martin at v.loewis.de Sat Aug 27 17:42:50 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sat, 27 Aug 2011 17:42:50 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> Message-ID: <4E59107A.4010001@v.loewis.de> > It is not my intention for the _lzma C module to do I/O - that will be done by > the LZMAFile class, which will be written in Python. My comparison with bz2 was > in reference to the state of the module after it was rewritten for issue 5863. Ok. I'll defer my judgement then until actual code is to review. Not sure whether you already have this: supporting the tarfile module would be nice. Regards, Martin From solipsis at pitrou.net Sat Aug 27 17:40:57 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 17:40:57 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> Message-ID: <20110827174057.6c4b619e@pitrou.net> On Sun, 28 Aug 2011 01:36:50 +1000 Nick Coghlan wrote: > On Sun, Aug 28, 2011 at 1:15 AM, "Martin v. L?wis" wrote: > > This is exactly what I worry about. I think adding file I/O to bz2 was a > > mistake, as this doesn't integrate with Python's IO library (it used > > to, but now after dropping stdio, they were incompatible. Indeed, for > > Python 3.2, BZ2File has been removed from the C module, and lifted to > > Python. > > > > IOW, the _lzma C module must not do any I/O, neither directly nor > > indirectly (through liblzma). The approach of gzip.py (doing IO > > and file formats in pure Python) is exactly right. > > PEP 399 also comes into play - we need a pure Python version for PyPy > et al (or a plausible story for why an exception should be granted). The plausible story being that we basically wrap an existing library? I don't think PyPy et al have pure Python versions of the zlib or OpenSSL, do they? If we start taking PEP 399 conformance to such levels, we might as well stop developing CPython. cheers Antoine. From ncoghlan at gmail.com Sat Aug 27 17:52:51 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Aug 2011 01:52:51 +1000 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <20110827174057.6c4b619e@pitrou.net> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: On Sun, Aug 28, 2011 at 1:40 AM, Antoine Pitrou wrote: > On Sun, 28 Aug 2011 01:36:50 +1000 > Nick Coghlan wrote: >> On Sun, Aug 28, 2011 at 1:15 AM, "Martin v. L?wis" wrote: >> > This is exactly what I worry about. I think adding file I/O to bz2 was a >> > mistake, as this doesn't integrate with Python's IO library (it used >> > to, but now after dropping stdio, they were incompatible. Indeed, for >> > Python 3.2, BZ2File has been removed from the C module, and lifted to >> > Python. >> > >> > IOW, the _lzma C module must not do any I/O, neither directly nor >> > indirectly (through liblzma). The approach of gzip.py (doing IO >> > and file formats in pure Python) is exactly right. >> >> PEP 399 also comes into play - we need a pure Python version for PyPy >> et al (or a plausible story for why an exception should be granted). > > The plausible story being that we basically wrap an existing library? > I don't think PyPy et al have pure Python versions of the zlib or > OpenSSL, do they? > > If we start taking PEP 399 conformance to such levels, we might as well > stop developing CPython. It's acceptable for the Python version to use ctypes in the case of wrapping an existing library, but the Python version should still exist. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From nadeem.vawda at gmail.com Sat Aug 27 17:58:11 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Sat, 27 Aug 2011 17:58:11 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 5:42 PM, "Martin v. L?wis" wrote: > Not sure whether you already have this: supporting the tarfile module > would be nice. Yes, got that - issue 5689. Also of interest is issue 5411 - adding .xz support to distutils. But I think that these are separate projects that should wait until the lzma module is finalized. On Sat, Aug 27, 2011 at 5:40 PM, Antoine Pitrou wrote: > On Sun, 28 Aug 2011 01:36:50 +1000 > Nick Coghlan wrote: >> PEP 399 also comes into play - we need a pure Python version for PyPy >> et al (or a plausible story for why an exception should be granted). > > The plausible story being that we basically wrap an existing library? > I don't think PyPy et al have pure Python versions of the zlib or > OpenSSL, do they? > > If we start taking PEP 399 conformance to such levels, we might as well > stop developing CPython. Indeed, PEP 399 specifically notes that exemptions can be granted for modules that wrap external C libraries. On Sat, Aug 27, 2011 at 5:52 PM, Nick Coghlan wrote: > It's acceptable for the Python version to use ctypes in the case of > wrapping an existing library, but the Python version should still > exist. I'm not too sure about that - PEP 399 explicitly says that using ctypes is frowned upon, and doesn't mention anywhere that it should be used in this sort of situation. Cheers, Nadeem From ncoghlan at gmail.com Sat Aug 27 18:04:39 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Aug 2011 02:04:39 +1000 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: On Sun, Aug 28, 2011 at 1:58 AM, Nadeem Vawda wrote: > On Sat, Aug 27, 2011 at 5:52 PM, Nick Coghlan wrote: >> It's acceptable for the Python version to use ctypes in the case of >> wrapping an existing library, but the Python version should still >> exist. > > I'm not too sure about that - PEP 399 explicitly says that using ctypes is > frowned upon, and doesn't mention anywhere that it should be used in this > sort of situation. Note to self: do not comment on python-dev at 2 am, as one's ability to read PEPs correctly apparently suffers :) Consider my comment withdrawn, you're quite right that PEP 399 actually says this is precisely the case where an exemption is a reasonable idea. Although I believe it's likely that PyPy will wrap it with ctypes anyway :) Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From arigo at tunes.org Sat Aug 27 18:14:10 2011 From: arigo at tunes.org (Armin Rigo) Date: Sat, 27 Aug 2011 18:14:10 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: Hi Antoine, > You then risk deadlocks. Say: > (...) Yes, it is indeed not a solution that co-operates transparently and deadlock-freely with regular locks. You risk the same kind of deadlocks as you would when using only locks. The reason is similar to threads that try to acquire two locks in succession. In your example: > - thread A is inside a "with atomic" and calls a library function which > tries to take lock L This is basically dangerous, because it corresponds to taking lock "GIL" and lock L, in that order, whereas the thread B takes lock L and plays around with lock "GIL" in the opposite order. I think a reasonable solution to avoid deadlocks is simply not to use explicit locks inside "with atomic" blocks. Generally speaking it can be regarded as wrong to do any action that causes an unbounded wait in a "with atomic" block, but the solution I chose to implement in my patch is to still allow them, because it doesn't make much sense to say that "print" or "pdb.set_trace()" are forbidden. A bient?t, Armin. From guido at python.org Sat Aug 27 18:19:31 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 27 Aug 2011 09:19:31 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On Fri, Aug 26, 2011 at 11:01 PM, Dan Stromberg wrote: [Steven] >> Have then been any __future__ features that were added provisionally? > > I can't either, but ISTR hearing that from __future__ import was started > with such an intent.? Irrespective, it's hard to import something from > "future" without at least suspecting that you're on the bleeding edge. No, this was not the intent of __future__. The intent is that a feature is desirable but also backwards incompatible (e.g. introduces a new keyword) so that for 1 (sometimes more) releases we require the users to use the __future__ import. There was never any intent to use __future__ for experimental features. If we want that maybe we could have from __experimental__ import . -- --Guido van Rossum (python.org/~guido) From drsalists at gmail.com Sat Aug 27 18:48:16 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 09:48:16 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On Sat, Aug 27, 2011 at 9:19 AM, Guido van Rossum wrote: > On Fri, Aug 26, 2011 at 11:01 PM, Dan Stromberg > wrote: > [Steven] > >> Have then been any __future__ features that were added provisionally? > > > > I can't either, but ISTR hearing that from __future__ import was started > > with such an intent. Irrespective, it's hard to import something from > > "future" without at least suspecting that you're on the bleeding edge. > > No, this was not the intent of __future__. The intent is that a > feature is desirable but also backwards incompatible (e.g. introduces > a new keyword) so that for 1 (sometimes more) releases we require the > users to use the __future__ import. > > There was never any intent to use __future__ for experimental > features. If we want that maybe we could have from __experimental__ > import . > > OK. So what -is- the purpose of from __future__ import? -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sat Aug 27 18:50:40 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 18:50:40 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: <20110827185040.5cb3064a@pitrou.net> On Sun, 28 Aug 2011 01:52:51 +1000 Nick Coghlan wrote: > > > > The plausible story being that we basically wrap an existing library? > > I don't think PyPy et al have pure Python versions of the zlib or > > OpenSSL, do they? > > > > If we start taking PEP 399 conformance to such levels, we might as well > > stop developing CPython. > > It's acceptable for the Python version to use ctypes in the case of > wrapping an existing library, but the Python version should still > exist. I think you're taking this too seriously. Our extension modules (_bz2, _ssl...) are *already* optional even on CPython. If the library or its development headers are not available on the system, building these extensions is simply skipped, and the test suite passes nonetheless. The only required libraries for passing the tests being basically the libc and the zlib. Regards Antoine. From brian.curtin at gmail.com Sat Aug 27 18:53:13 2011 From: brian.curtin at gmail.com (Brian Curtin) Date: Sat, 27 Aug 2011 11:53:13 -0500 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On Sat, Aug 27, 2011 at 11:48, Dan Stromberg wrote: > > No, this was not the intent of __future__. The intent is that a >> feature is desirable but also backwards incompatible (e.g. introduces >> a new keyword) so that for 1 (sometimes more) releases we require the >> users to use the __future__ import. >> >> There was never any intent to use __future__ for experimental >> features. If we want that maybe we could have from __experimental__ >> import . >> >> OK. So what -is- the purpose of from __future__ import? > It's in the first paragraph. -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Sat Aug 27 19:07:47 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 19:07:47 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: <4E592463.8020305@v.loewis.de> >>> PEP 399 also comes into play - we need a pure Python version for PyPy >>> et al (or a plausible story for why an exception should be granted). No, we don't. We can grant an exception, which I'm very willing to do. The PEP lists wrapping a specific C-based library as a plausible reason. > It's acceptable for the Python version to use ctypes Hmm. To me, *that's* unacceptable. In the specific case, having a pure-Python implementation would be acceptable to me, but I'm skeptical that anybody is willing to produce one. Regards, Martin From neologix at free.fr Sat Aug 27 19:11:18 2011 From: neologix at free.fr (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Sat, 27 Aug 2011 19:11:18 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: Hi Armin, > This is basically dangerous, because it corresponds to taking lock > "GIL" and lock L, in that order, whereas the thread B takes lock L and > plays around with lock "GIL" in the opposite order. ?I think a > reasonable solution to avoid deadlocks is simply not to use explicit > locks inside "with atomic" blocks. The problem is that many locks are actually acquired implicitely. For example, `print` to a buffered stream will acquire the fileobject's mutex. Also, even if the code inside the "with atomic" block doesn't directly or indirectely acquire a lock, there's still the possibility of asynchronous code that acquire locks being executed in the middle of this block: for example, signal handlers are run on behalf of the main thread from the main eval loop and in certain other places, and the GC might kick in at any time. > Generally speaking it can be regarded as wrong to do any action that > causes an unbounded wait in a "with atomic" block, Indeed. cf From martin at v.loewis.de Sat Aug 27 19:11:58 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 19:11:58 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110827121012.37b39947@pitrou.net> References: <4E582432.2080301@v.loewis.de> <4E588877.3080204@v.loewis.de> <20110827121012.37b39947@pitrou.net> Message-ID: <4E59255E.6000905@v.loewis.de> Am 27.08.2011 12:10, schrieb Antoine Pitrou: > On Sat, 27 Aug 2011 08:02:31 +0200 > "Martin v. L?wis" wrote: >>> I'm not sure it's worth doing an extensive review of the code, a better >>> approach might be to require extensive test coverage (and a review of >>> tests). >> >> I think it's worth. It's really bad if only one developer fully >> understands the regex implementation. > > Could such a review be the topic of an informational PEP? Well, the reviewer would also have to dive into the code details, e.g. through Rietveld. Of course, referencing the Rietveld issue in the PEP might be appropriate. A PEP should IMO only cover end-user aspects of the new re module. Code organization is typically not in the PEP. To give a specific example: you mentioned that there is (near) code duplication MRAB's module. As a reviewer, I would discuss whether this can be eliminated - but not in the PEP. Regards, Martin From solipsis at pitrou.net Sat Aug 27 19:36:01 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 27 Aug 2011 19:36:01 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <20110827185040.5cb3064a@pitrou.net> Message-ID: <20110827193601.60582ee5@pitrou.net> On Sat, 27 Aug 2011 18:50:40 +0200 Antoine Pitrou wrote: > On Sun, 28 Aug 2011 01:52:51 +1000 > Nick Coghlan wrote: > > > > > > The plausible story being that we basically wrap an existing library? > > > I don't think PyPy et al have pure Python versions of the zlib or > > > OpenSSL, do they? > > > > > > If we start taking PEP 399 conformance to such levels, we might as well > > > stop developing CPython. > > > > It's acceptable for the Python version to use ctypes in the case of > > wrapping an existing library, but the Python version should still > > exist. > > I think you're taking this too seriously. Our extension modules (_bz2, > _ssl...) are *already* optional even on CPython. If the library or its > development headers are not available on the system, building these > extensions is simply skipped, and the test suite passes nonetheless. > The only required libraries for passing the tests being basically the > libc and the zlib. ...and, apparently, pyexpat... From drsalists at gmail.com Sat Aug 27 20:20:10 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 11:20:10 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On Sat, Aug 27, 2011 at 9:53 AM, Brian Curtin wrote: > On Sat, Aug 27, 2011 at 11:48, Dan Stromberg wrote: >> >> No, this was not the intent of __future__. The intent is that a >>> feature is desirable but also backwards incompatible (e.g. introduces >>> a new keyword) so that for 1 (sometimes more) releases we require the >>> users to use the __future__ import. >>> >>> There was never any intent to use __future__ for experimental >>> features. If we want that maybe we could have from __experimental__ >>> import . >>> >>> OK. So what -is- the purpose of from __future__ import? >> > > It's in the first paragraph. > I disagree. The first paragraph says this has something to do with new keywords. It doesn't appear to say what we expect users to -do- with it. Both are important. Is it "You'd better try this, because it's going in eventually. If you don't try it out before it becomes default behavior, you have no right to complain"? And if people do complain, what are python-dev's options? -------------- next part -------------- An HTML attachment was scrubbed... URL: From hsoft at hardcoded.net Sat Aug 27 20:33:12 2011 From: hsoft at hardcoded.net (Virgil Dupras) Date: Sat, 27 Aug 2011 14:33:12 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: On 2011-08-27, at 2:20 PM, Dan Stromberg wrote: > > On Sat, Aug 27, 2011 at 9:53 AM, Brian Curtin wrote: > On Sat, Aug 27, 2011 at 11:48, Dan Stromberg wrote: > No, this was not the intent of __future__. The intent is that a > feature is desirable but also backwards incompatible (e.g. introduces > a new keyword) so that for 1 (sometimes more) releases we require the > users to use the __future__ import. > > There was never any intent to use __future__ for experimental > features. If we want that maybe we could have from __experimental__ > import . > > OK. So what -is- the purpose of from __future__ import? > > It's in the first paragraph. > > I disagree. The first paragraph says this has something to do with new keywords. It doesn't appear to say what we expect users to -do- with it. Both are important. > > Is it "You'd better try this, because it's going in eventually. If you don't try it out before it becomes default behavior, you have no right to complain"? > > And if people do complain, what are python-dev's options? > __future__ imports have nothing to do with "trying stuff before it comes", it has to do with backward compatibility. For example, the "with_statement" was a __future__ import because introducing the "with" keyword would break any code using "with" as a token. I don't think that the goal of introducing "with" as a future import was "we're gonna see how it pans out, and decide if we really introduce it later". __future__ means "It's coming, prepare your code". From martin at v.loewis.de Sat Aug 27 21:05:35 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 21:05:35 +0200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: <4E593FFF.1030203@v.loewis.de> > I disagree. The first paragraph says this has something to do with new > keywords. It doesn't appear to say what we expect users to -do- with > it. Both are important. Well, users can use the new features... > Is it "You'd better try this, because it's going in eventually. If you > don't try it out before it becomes default behavior, you have no right > to complain"? No. It's "we have that feature which will be activated in a future version. If you want to use it today, use the __future__ import. If you don't want to use it (now or in the future), just don't." > And if people do complain, what are python-dev's options? That will depend on the complaint. If it's "I don't like the new feature", then the obvious response is "don't use it, then". Regards, Martin From tjreedy at udel.edu Sat Aug 27 21:47:00 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 27 Aug 2011 15:47:00 -0400 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: Message-ID: On 8/27/2011 9:47 AM, Nadeem Vawda wrote: > I'd like to propose the addition of a new module in Python 3.3. The 'lzma' > module will provide support for compression and decompression using the LZMA > algorithm, and the .xz and .lzma file formats. The matter has already been > discussed on the tracker, where there seems > to be a consensus that this is a desirable feature. What are your thoughts? As I read the discussion, the idea has been more or less accepted in principle. However, the current patch is not and needs changes. > The proposed module's API will be very similar to that of the bz2 module; > the only differences will be additional keyword arguments to some functions, > for specifying container formats and detailed compressor options. I believe Antoine suggested a PEP. It should summarize the salient points in the long tracker discussion into a coherent exposition and flesh out the details implied above. (Perhaps they are already in the proposed doc addition.) > The implementation will also be similar to bz2 - basic compressor and > decompressor classes written in C, with convenience functions and a file > interface implemented on top of those in Python. I would follow Martin's suggestions, including doing all i/o with the io module and the following: "So I would propose that a very thin C layer is created around the C library that focuses on the actual algorithms, and that any higher layers (in particular file formats) are done in Python." If we minimize the C code we add and maximize what is done in Python, that would maximize the ease of porting to other implementations. This would conform to the spirit of PEP 399. -- Terry Jan Reedy From steve at pearwood.info Sat Aug 27 21:55:52 2011 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 28 Aug 2011 05:55:52 +1000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: <4E594BC8.4060802@pearwood.info> Dan Stromberg wrote: > On Sat, Aug 27, 2011 at 9:53 AM, Brian Curtin wrote: > >> On Sat, Aug 27, 2011 at 11:48, Dan Stromberg wrote: >>> No, this was not the intent of __future__. The intent is that a >>>> feature is desirable but also backwards incompatible (e.g. introduces >>>> a new keyword) so that for 1 (sometimes more) releases we require the >>>> users to use the __future__ import. >>>> >>>> There was never any intent to use __future__ for experimental >>>> features. If we want that maybe we could have from __experimental__ >>>> import . >>>> >>>> OK. So what -is- the purpose of from __future__ import? >> It's in the first paragraph. >> > > I disagree. The first paragraph says this has something to do with new > keywords. It doesn't appear to say what we expect users to -do- with it. > Both are important. Have you read the PEP? I found it very helpful. http://www.python.org/dev/peps/pep-0236/ The motivation given in the first paragraph is pretty clear to me: __future__ is machinery added to Python to aid the transition when a backwards incompatible change is made. Perhaps it needs a note stating explicitly that it is not for trying out new features which may or may not be added at a later date. That may help prevent confusion in the, er, future. [...] > And if people do complain, what are python-dev's options? The PEP includes a question very similar to that: Q: Going back to the nested_scopes example, what if release 2.2 comes along and I still haven't changed my code? How can I keep the 2.1 behavior then? A: By continuing to use 2.1, and not moving to 2.2 until you do change your code. The purpose of future_statement is to make life easier for people who keep current with the latest release in a timely fashion. We don't hate you if you don't, but your problems are much harder to solve, and somebody with those problems will need to write a PEP addressing them. future_statement is aimed at a different audience. To me, it's quite clear: once a feature change hits __future__, it is already part of the language. It may be an optional part for at least one release, but removing it again will require the same deprecation process as removing any other language feature (see PEP 5 for more details). -- Steven From digitalxero at gmail.com Sat Aug 27 21:57:26 2011 From: digitalxero at gmail.com (Dj Gilcrease) Date: Sat, 27 Aug 2011 15:57:26 -0400 Subject: [Python-Dev] Add from __experimental__ import bla [was: Should we move to replace re with regex?] Message-ID: In the thread about replacing re with regex someone mentioned adding to __future__ which isnt a great idea as future APIs are already solidified, they just live there to give developer time to adapt their code. The idea of a __experimental__ area is good for any pep's or stliib additions that are somewhat controversial (API isnt agreed on, code may take a while to integrate properly, developer wants some time to hash out any edge case bugs or API clarifications that may come up in large scale testing, etc). __experimental__ should emit a warning on import that says anything in here may change or be removed at any time and should not be used in stable code. __experimental__ features should behave the same as __future__ in that they can add new keywords or semantics to the existing language __experimental__ features can move directly to the stlib or builtins if they do not add new keywords and/or are backwards compatible with the feature they are replacing. Otherwise they move into __future__ for how ever many releases are deemed reasonable time for developers to adapt their code. From drsalists at gmail.com Sat Aug 27 21:58:39 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 12:58:39 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 9:04 AM, Nick Coghlan wrote: > On Sun, Aug 28, 2011 at 1:58 AM, Nadeem Vawda > wrote: > > On Sat, Aug 27, 2011 at 5:52 PM, Nick Coghlan > wrote: > >> It's acceptable for the Python version to use ctypes in the case of > >> wrapping an existing library, but the Python version should still > >> exist. > > > > I'm not too sure about that - PEP 399 explicitly says that using ctypes > is > > frowned upon, and doesn't mention anywhere that it should be used in this > > sort of situation. > > Note to self: do not comment on python-dev at 2 am, as one's ability > to read PEPs correctly apparently suffers :) > > Consider my comment withdrawn, you're quite right that PEP 399 > actually says this is precisely the case where an exemption is a > reasonable idea. Although I believe it's likely that PyPy will wrap it > with ctypes anyway :) > I'd like to better understand why ctypes is (sometimes) frowned upon. Is it the brittleness? Tendency to segfault? If yes, is there a way of making ctypes less brittle - say, by carefully matching it against a specific version of a .so/.dll before starting to make heavy use of said .so/.dll? FWIW, I have a partial implementation of a module that does xz from Python using ctypes. It only does in-memory compression and decompression (not stream compression or decompression to or from a file), because that was all I needed for my current project, but it runs on CPython 2.x, CPython 3.x, and PyPy. I don't think it runs on Jython, but I've not looked at that carefully - my code falls back on subprocess if ctypes doesn't appear to be all there. It's at http://stromberg.dnsalias.org/svn/xz_mod/trunk/xz_mod.py -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Sat Aug 27 22:21:41 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 27 Aug 2011 22:21:41 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: <4E5951D5.5020200@v.loewis.de> > I'd like to better understand why ctypes is (sometimes) frowned upon. > > Is it the brittleness? Tendency to segfault? That, and Python should work completely if ctypes is not available. > FWIW, I have a partial implementation of a module that does xz from > Python using ctypes. So does it work on Sparc/Solaris? On OpenBSD? On ARM-Linux? Does it work if the xz library is installed into /opt/sfw/xz? Regards, Martin From nadeem.vawda at gmail.com Sat Aug 27 22:36:52 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Sat, 27 Aug 2011 22:36:52 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 9:47 PM, Terry Reedy wrote: > On 8/27/2011 9:47 AM, Nadeem Vawda wrote: >> I'd like to propose the addition of a new module in Python 3.3. The 'lzma' >> module will provide support for compression and decompression using the >> LZMA >> algorithm, and the .xz and .lzma file formats. The matter has already been >> discussed on the tracker, where there >> seems >> to be a consensus that this is a desirable feature. What are your >> thoughts? > > As I read the discussion, the idea has been more or less accepted in > principle. However, the current patch is not and needs changes. Please note that the code I'm talking about is not the same as the patches by Per ?yvind Karlsen that are attached to the tracker issue. I have been doing a completely new implementation of the module, specifically to address the concerns raised by Martin and Antoine. (As for why I haven't posted my own changes yet - I'm currently an intern at Google, and they want me to run my code by their open-source team before releasing it into the wild. Sorry for the delay and the confusion.) >> The proposed module's API will be very similar to that of the bz2 module; >> the only differences will be additional keyword arguments to some >> functions, >> for specifying container formats and detailed compressor options. > > I believe Antoine suggested a PEP. It should summarize the salient points in > the long tracker discussion into a coherent exposition and flesh out the > details implied above. (Perhaps they are already in the proposed doc > addition.) I talked to Antoine about this on IRC; he didn't seem to think a PEP would be necessary. But a summary of the discussion on the tracker issue might still be a useful thing to have, given how long it's gotten. >> The implementation will also be similar to bz2 - basic compressor and >> decompressor classes written in C, with convenience functions and a file >> interface implemented on top of those in Python. > > I would follow Martin's suggestions, including doing all i/o with the io > module and the following: > "So I would propose that a very thin C layer is created around the C > library that focuses on the actual algorithms, and that any higher > layers (in particular file formats) are done in Python." > > If we minimize the C code we add and maximize what is done in Python, that > would maximize the ease of porting to other implementations. This would > conform to the spirit of PEP 399. As stated in my earlier response to Martin, I intend to do this. Aside from I/O, though, there's not much that _can_ be done in Python - the rest is basically just providing a thin wrapper for the C library. On Sat, Aug 27, 2011 at 9:58 PM, Dan Stromberg wrote: > I'd like to better understand why ctypes is (sometimes) frowned upon. > > Is it the brittleness?? Tendency to segfault? The problem (as I understand it) is that ABI changes in a library will cause code that uses it via ctypes to break without warning. With an extension module, you'll get a compile failure if you rely on things that change in an incompatible way. With a ctypes wrapper, you just get incorrect answers, or segfaults. > If yes, is there a way of making ctypes less brittle - say, by > carefully matching it against a specific version of a .so/.dll before > starting to make heavy use of said .so/.dll? This might be feasible for a specific application running in a controlled environment, but it seems impractical for something as widely-used as the stdlib. Having to include a whitelist of acceptable library versions would be a substantial maintenance burden, and (compatible) new versions would not work until the library whitelist gets updated. Cheers, Nadeem From drsalists at gmail.com Sat Aug 27 22:41:07 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 13:41:07 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <4E5951D5.5020200@v.loewis.de> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 1:21 PM, "Martin v. L?wis" wrote: > > I'd like to better understand why ctypes is (sometimes) frowned upon. > > > > Is it the brittleness? Tendency to segfault? > > That, and Python should work completely if ctypes is not available. > What are the most major platforms ctypes doesn't work on? It seems like there should be some way of coming up with an xml file describing the types of the various bits of data and formal arguments - perhaps using gccxml or something like it. > FWIW, I have a partial implementation of a module that does xz from > > Python using ctypes. > > So does it work on Sparc/Solaris? On OpenBSD? On ARM-Linux? Does it > work if the xz library is installed into /opt/sfw/xz? > So far, I've only tried it on a couple of Linuxes and Cygwin. I intend to try it on a large number of *ix variants in the future, including OS/X and Haiku. I doubt I'll test OpenBSD, but I'm likely to test on FreeBSD and Dragonfly again. With regard to /opt/sfw/xz, if ctypes.util.find_library(library) is smart enough to look there, then yes, xz_mod should find libxz there. On Cygwin, ctypes.util.find_library() wasn't smart enough to find a Cygwin DLL, so I coded around that. But it finds the library OK on the Linuxes I've tried so far. (This is part of a larger project, a backup program. The backup program has been tested on a large number of OS's, but I've not done another broad round of testing yet since adding the ctypes+xz code) -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at haypocalc.com Sat Aug 27 22:54:48 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Sat, 27 Aug 2011 22:54:48 +0200 Subject: [Python-Dev] Add from __experimental__ import bla [was: Should we move to replace re with regex?] In-Reply-To: References: Message-ID: <201108272254.48635.victor.stinner@haypocalc.com> Le samedi 27 ao?t 2011 21:57:26, Dj Gilcrease a ?crit : > The idea of a __experimental__ area is good for any pep's or > stliib additions that are somewhat controversial (API isnt agreed on, > code may take a while to integrate properly, developer wants some time > to hash out any edge case bugs or API clarifications that may come up > in large scale testing, etc). __experimental__ does already exist, it's the Python Package Index (PyPI) ! http://pypi.python.org/pypi You can write Python extensions in C and distribute them on the PyPI. I did that when my patch to display the Python backtrace on a crash was "rejected" (not included in Python 3.2, just before the release). It was a great idea, because I had more time to change the API (read the history of the faulthandler module on PyPI: the API changed 5 times since the first public version on PyPI...) and the module is now available for Python 2.5 - 3.2, not only for Python 3.3. Remember that the API of a module added to CPython is frozen. You will have to wait something like 18 months until the next CPython release to change anything (add a new function, remove an old/useless function, etc.). Seriously, it's not a good idea to add a young module into Python before its API is well defined and stable. The Linux kernel has "staging" drivers. It's different because there is a new release of the Linux kernel each two months (instead of 18 months for CPython). The policy for the API is also different: the kernel has no stable API, whereas the Python API cannot be changed in minor release (x.y.Z). http://www.kroah.com/log/linux/stable_api_nonsense.html http://www.mjmwired.net/kernel/Documentation/stable_api_nonsense.txt Victor From exarkun at twistedmatrix.com Sat Aug 27 23:02:23 2011 From: exarkun at twistedmatrix.com (exarkun at twistedmatrix.com) Date: Sat, 27 Aug 2011 21:02:23 -0000 Subject: [Python-Dev] Add from __experimental__ import bla [was: Should we move to replace re with regex?] In-Reply-To: References: Message-ID: <20110827210223.1808.46364677.divmod.xquotient.81@localhost.localdomain> On 07:57 pm, digitalxero at gmail.com wrote: >In the thread about replacing re with regex someone mentioned adding >to __future__ which isnt a great idea as future APIs are already >solidified, they just live there to give developer time to adapt their >code. The idea of a __experimental__ area is good for any pep's or >stliib additions that are somewhat controversial (API isnt agreed on, >code may take a while to integrate properly, developer wants some time >to hash out any edge case bugs or API clarifications that may come up >in large scale testing, etc). > >__experimental__ should emit a warning on import that says anything in >here may change or be removed at any time and should not be used in >stable code. > >__experimental__ features should behave the same as __future__ in that >they can add new keywords or semantics to the existing language > >__experimental__ features can move directly to the stlib or builtins >if they do not add new keywords and/or are backwards compatible with >the feature they are replacing. Otherwise they move into __future__ >for how ever many releases are deemed reasonable time for developers >to adapt their code. Hi Dj, As a developer of Python libraries and applications, I don't see how this would make my life easier. A warning in a module docstring that a module may not be long-lived if it is not well received tells me just as much as a warning emitted at runtime. And a warning emitted at runtime is likely to scare my users into thinking something is broken, leading to spurious or misleading bug reports. There also does not appear to be general consensus that modules should be added to stdlib if they are not widely used and demanded, so I don't know when a module would be added to __experimental__, anyway. The normal deprecation procedures (rarely used as they are) seem to cover this, anyway. Adding a new namespace separate from __future__ also just gives me another thing to remember. Was the feature added to __experimental__ or __future__? Also, it seems even less common that language features are added on an experimental basis. When a language feature (new syntax or semantics) goes in to the language, it is there for a long, long time. If new features are added first to __experimental__ and then to __future__ or the non-__experimental__ stdlib namespace, then I just have to update all my code to keep using it. So I'm guaranteed extra work whether the feature is successful and is adopted or if it fails and is later removed. I'd rather not have to do the extra work in the success case, at least, which is what the existing add-it-and-then-maybe -(but-probably-not-)deprecate it approach gives me. Jean-Paul >_______________________________________________ >Python-Dev mailing list >Python-Dev at python.org >http://mail.python.org/mailman/listinfo/python-dev >Unsubscribe: http://mail.python.org/mailman/options/python- >dev/exarkun%40twistedmatrix.com From nadeem.vawda at gmail.com Sat Aug 27 23:38:43 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Sat, 27 Aug 2011 23:38:43 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 10:41 PM, Dan Stromberg wrote: > It seems like there should be some way of coming up with an xml file > describing the types of the various bits of data and formal arguments - > perhaps using gccxml or something like it. The problem is that you would need to do this check at runtime, every time you load up the library - otherwise, what happens if the user upgrades their installed copy of liblzma? And we can't expect users to have the liblzma headers installed, so we'd have to try and figure out whether the library was ABI-compatible from the shared object alone; I doubt that this is even possible. From drsalists at gmail.com Sun Aug 28 00:14:15 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 15:14:15 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 2:38 PM, Nadeem Vawda wrote: > On Sat, Aug 27, 2011 at 10:41 PM, Dan Stromberg > wrote: > > It seems like there should be some way of coming up with an xml file > > describing the types of the various bits of data and formal arguments - > > perhaps using gccxml or something like it. > > The problem is that you would need to do this check at runtime, every time > you load up the library - otherwise, what happens if the user upgrades > their installed copy of liblzma? And we can't expect users to have the > liblzma headers installed, so we'd have to try and figure out whether the > library was ABI-compatible from the shared object alone; I doubt that this > is even possible. > I was thinking about this as I was getting groceries a bit ago. Why -can't- we expect the user to have liblzma headers installed? Couldn't it just be a dependency in the package management system? BTW, gcc-xml seems to be only for C++ (?), but long ago, around the time people were switching from K&R to Ansi C, there were programs like "mkptypes" that could parse a .c/.h and output prototypes. It seems we could do something like this on module init. IMO, we really, really need some common way of accessing C libraries that works for all major Python variants. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sun Aug 28 00:26:42 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 28 Aug 2011 00:26:42 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: <20110828002642.4765fc89@pitrou.net> On Sat, 27 Aug 2011 15:14:15 -0700 Dan Stromberg wrote: > On Sat, Aug 27, 2011 at 2:38 PM, Nadeem Vawda wrote: > > > On Sat, Aug 27, 2011 at 10:41 PM, Dan Stromberg > > wrote: > > > It seems like there should be some way of coming up with an xml file > > > describing the types of the various bits of data and formal arguments - > > > perhaps using gccxml or something like it. > > > > The problem is that you would need to do this check at runtime, every time > > you load up the library - otherwise, what happens if the user upgrades > > their installed copy of liblzma? And we can't expect users to have the > > liblzma headers installed, so we'd have to try and figure out whether the > > library was ABI-compatible from the shared object alone; I doubt that this > > is even possible. > > > > I was thinking about this as I was getting groceries a bit ago. > > Why -can't- we expect the user to have liblzma headers installed? Couldn't > it just be a dependency in the package management system? Package managers, under Linux, often split development files (headers, etc.) from runtime binaries. Also, under Windows, most users don't have development stuff installed at all. Regards Antoine. From martin at v.loewis.de Sun Aug 28 00:47:19 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 28 Aug 2011 00:47:19 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: <4E5973F7.30805@v.loewis.de> > Why -can't- we expect the user to have liblzma headers installed? > Couldn't it just be a dependency in the package management system? Please give it up. You just won't convince that list that ctypes is a viable approach for the standard library. Regards, Martin From drsalists at gmail.com Sun Aug 28 01:19:01 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 16:19:01 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <20110828002642.4765fc89@pitrou.net> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 3:26 PM, Antoine Pitrou wrote: > On Sat, 27 Aug 2011 15:14:15 -0700 > Dan Stromberg wrote: > > > On Sat, Aug 27, 2011 at 2:38 PM, Nadeem Vawda >wrote: > > > > > On Sat, Aug 27, 2011 at 10:41 PM, Dan Stromberg > > > wrote: > > > > It seems like there should be some way of coming up with an xml file > > > > describing the types of the various bits of data and formal arguments > - > > > > perhaps using gccxml or something like it. > > > > > > The problem is that you would need to do this check at runtime, every > time > > > you load up the library - otherwise, what happens if the user upgrades > > > their installed copy of liblzma? And we can't expect users to have the > > > liblzma headers installed, so we'd have to try and figure out whether > the > > > library was ABI-compatible from the shared object alone; I doubt that > this > > > is even possible. > > > > > > > I was thinking about this as I was getting groceries a bit ago. > > > > Why -can't- we expect the user to have liblzma headers installed? > Couldn't > > it just be a dependency in the package management system? > > Package managers, under Linux, often split development files (headers, > etc.) from runtime binaries. > Well, uhhhhh, yeah. Not sure what your point is. 1) We could easily work with the dev / nondev distinction by taking a dependency on the -dev version of whatever we need, instead of the nondev version. 2) It's a rather arbitrary distinction that's being drawn between dev and nondev today. There's no particular reason why the line couldn't be drawn somewhere else. > Also, under Windows, most users don't have development stuff installed > at all. > Yes... But if the nature of "what development stuff is" were to change, they'd have different stuff. Also, we wouldn't have to parse the .h's every time a module is loaded - we could have a timestamp file (or database) indicating when we last parsed a given .h. Also, we could query the package management system for the version of lzma that's currently installed on module init. Also, we could include our own version of lzma. Granted, this was a mess when zlib needed to be patched, but even this one might be worth it for the improved library unification across Python implementations. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sun Aug 28 01:27:05 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 28 Aug 2011 01:27:05 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> Message-ID: <20110828012705.523e51d4@pitrou.net> On Sat, 27 Aug 2011 16:19:01 -0700 Dan Stromberg wrote: > 2) It's a rather arbitrary distinction that's being drawn between dev and > nondev today. There's no particular reason why the line couldn't be drawn > somewhere else. Sure. Now please convince Linux distributions first, because this particular subthread is going nowhere. Regards Antoine. From greg.ewing at canterbury.ac.nz Sun Aug 28 01:39:10 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 28 Aug 2011 11:39:10 +1200 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> Message-ID: <4E59801E.7080406@canterbury.ac.nz> Nick Coghlan wrote: > The next step needed is for someone to volunteer to write and champion > a PEP that: Would it be feasible and desirable to modify regex so that it *is* backwards-compatible with re, with a view to making it a drop-in replacement at some point? If not, the PEP should discuss this also. -- Greg From tjreedy at udel.edu Sun Aug 28 02:48:02 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 27 Aug 2011 20:48:02 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E59801E.7080406@canterbury.ac.nz> References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> <4E59801E.7080406@canterbury.ac.nz> Message-ID: On 8/27/2011 7:39 PM, Greg Ewing wrote: > Nick Coghlan wrote: > >> The next step needed is for someone to volunteer to write and champion >> a PEP that: > > Would it be feasible and desirable to modify regex so > that it *is* backwards-compatible with re, with a view > to making it a drop-in replacement at some point? > > If not, the PEP should discuss this also. Many of the things regex does differently might be called either bug fixes or feature changes, depending on one's viewpoint. Regex should definitely not be 'bug-compatible'. I think regex should be unicode-standard compliant as much as possible, and let the chips fall where they may. If so, it would be like the decimal module, which closely tracks the IEEE decimal standard, rather than the binary float standard. Regex is already much more compliant than re, as shown by Tom Christiansen. This is pretty obviously intentional on MB's part. It is also probably intentional that re *not* match today's Unicode TR18 specifications. These are reasons why both Ezio and I suggested on the tracker adding regex without deleting re. (I personally would not mind just replacing re with regex, but then I have no legacy re code to break. So I am not suggesting that out of respect for those who do.) -- Terry Jan Reedy From drsalists at gmail.com Sun Aug 28 03:28:20 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 18:28:20 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <20110828012705.523e51d4@pitrou.net> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 4:27 PM, Antoine Pitrou wrote: > > Sure. Now please convince Linux distributions first, because this > particular subthread is going nowhere. > I hope you're not a solipsist. Anyway, if the mere -discussion- of embracing a standard and safe way of making C libraries callable from all the major Python implementations is "going nowhere" before the discussion has even gotten started, I fear for Python's future. Repeat aloud to yourself: Python != CPython. Python != CPython. Python != CPython. Has this topic been discussed to death? If so, then say so. It's rude to try to kill the thread summarily before it gets started, sans discussion, sans explanation, sans commentary on whether new additions to the topic have surfaced or not. -------------- next part -------------- An HTML attachment was scrubbed... URL: From drsalists at gmail.com Sun Aug 28 03:33:10 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 18:33:10 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <20110828012705.523e51d4@pitrou.net> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 4:27 PM, Antoine Pitrou wrote: > On Sat, 27 Aug 2011 16:19:01 -0700 > Dan Stromberg wrote: > > 2) It's a rather arbitrary distinction that's being drawn between dev and > > nondev today. There's no particular reason why the line couldn't be > drawn > > somewhere else. > > Sure. Now please convince Linux distributions first, because this > particular subthread is going nowhere. > Interesting. You seem to want to throw an arbitrary barrier between Python, the language, and accomplishing something important for said language. Care to tell me why I'm wrong? I'm all ears. I'll note that you've deleted: > 1) We could easily work with the dev / nondev distinction by > taking a dependency on the -dev version of whatever we need, > instead of the nondev version. ...which makes it more than apparent that we needn't convince Linux distributors of #2, which you seem to prefer to focus on. Why was it in your best interest to delete #1, without even commenting on it? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ezio.melotti at gmail.com Sun Aug 28 05:19:15 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Sun, 28 Aug 2011 06:19:15 +0300 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> <4E59801E.7080406@canterbury.ac.nz> Message-ID: On Sun, Aug 28, 2011 at 3:48 AM, Terry Reedy wrote: > > These are reasons why both Ezio and I suggested on the tracker adding regex > without deleting re. (I personally would not mind just replacing re with > regex, but then I have no legacy re code to break. So I am not suggesting > that out of respect for those who do.) > I would actually prefer to replace re. Before doing that we should make a list of all the differences between the two modules (possibly in the PEP). On the regex page on PyPI there's already a list that can be used for this purpose [0]. For bug fixes it *shouldn't* be a problem if the behavior changes. New features shouldn't bring any backward-incompatible behavioral changes, and, as far as I understand, Matthew introduced the NEW flag [1], to avoid problems when they do. I think re should be kept around only if there are too many incompatibilities left and if they can't be fixed in regex. Best Regards, Ezio Melotti [0]: http://pypi.python.org/pypi/regex/0.1.20110717 [1]: "The NEW flag turns on the new behaviour of this module, which can differ from that of the 're' module, such as splitting on zero-width matches, inline flags affecting only what follows, and being able to turn inline flags off." -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Sun Aug 28 05:54:13 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 27 Aug 2011 20:54:13 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> <4E59801E.7080406@canterbury.ac.nz> Message-ID: On Sat, Aug 27, 2011 at 5:48 PM, Terry Reedy wrote: > Many of the things regex does differently might be called either bug fixes > or feature changes, depending on one's viewpoint. Regex should definitely > not be 'bug-compatible'. Well, as you said, it depends on one's viewpoint. If there's a bug in the treatment of non-BMP character ranges, that's a bug, and fixing it shouldn't break anybody's code (unless it was worth breaking :-). But if there's a change that e.g. (hypothetical example) makes a different choice about how empty matches are treated in some edge case, and the old behavior was properly documented, that's a feature change, and I'd rather introduce a flag to select the new behavior (or, if we have to, a flag to preserve the old behavior, if the new behavior is really considered much better and much more useful). > I think regex should be unicode-standard compliant as much as possible, and > let the chips fall where they may. In most cases the Unicode improvements in regex are not where it is incompatible; e.g. adding \X and named ranges are fine new additions and IIUC the syntax was carefully designed not to introduce any incompatibilities (within the limitations of \-escapes). It's the many other "improvements" to the regex module that sometimes make it incompatible.There's a comprehensive list here: http://pypi.python.org/pypi/regex . Somebody should just go over it and for each difference make a recommendation for whether to treat this as a bugfix, a compatible new feature, or an incompatibility that requires some kind of flag. (We could have a single flag for all incompatibilities, or several flags.) > If so, it would be like the decimal > module, which closely tracks the IEEE decimal standard, rather than the > binary float standard. Well, I would hope that for each "major" Python version (i.e. 3.2, 3.3, 3.4, ...) we would pick a specific version of the Unicode standard and declare our desire to be compliant with that Unicode standard version, and not switch allegiances in some bugfix version (e.g. 3.2.3, 3.3.1, ...). > Regex is already much more compliant than re, as shown by Tom Christiansen. Nobody disagrees with this or thinks it's a bad thing. :-) > This is pretty obviously intentional on MB's part. That's also clear. > It is also probably intentional that re *not* match today's Unicode > TR18 specifications. That I'm not so sure of. I think it's more the case that TR18 evolved and that the re modules didn't -- probably mostly because nobody had the time and nobody was aware of the TR18 changes. > These are reasons why both Ezio and I suggested on the tracker adding regex > without deleting re. (I personally would not mind just replacing re with > regex, but then I have no legacy re code to break. So I am not suggesting > that out of respect for those who do.) That option is definitely still on the table. At the very least a thorough review of the stated differences between re and regex should be done -- I trust that MR has been very thorough in his listing of those differences. The issues regarding maintenance and stability of MR's code can be solved in a number of ways -- if MR doesn't mind I would certainly be willing to give him core committer access (though I'd still recommend that he use his time primarily to train others in maintaining this important code base). -- --Guido van Rossum (python.org/~guido) From guido at python.org Sun Aug 28 05:57:21 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 27 Aug 2011 20:57:21 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 3:14 PM, Dan Stromberg wrote: > IMO, we really, really need some common way of accessing C libraries that > works for all major Python variants. We have one. It's called writing an extension module. ctypes is a crutch because it doesn't realistically have access to the header files. It's a fine crutch for PyPy, which doesn't have much of an alternative. It's also a fine crutch for people who need something to run *now*. It's a horrible strategy for the standard library. If you have a better proposal please do write it up. But so far you are mostly exposing your ignorance and insisting dramatically that you be educated. -- --Guido van Rossum (python.org/~guido) From ezio.melotti at gmail.com Sun Aug 28 05:59:59 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Sun, 28 Aug 2011 06:59:59 +0300 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110827035602.557f772f@pitrou.net> References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 4:56 AM, Antoine Pitrou wrote: > On Sat, 27 Aug 2011 04:37:21 +0300 > Ezio Melotti wrote: > > > > I'm not sure it's worth doing an extensive review of the code, a better > > approach might be to require extensive test coverage (and a review of > > tests). If the code seems well written, commented, documented (I think > > proper rst documentation is still missing), > > Isn't this precisely what a review is supposed to assess? > This can be done without actually knowing and understanding every single function in the module (I got the impression that someone wants this kind of review, correct me if I'm wrong). > > > We will get familiar with the code once we start contributing > > to it and fixing bugs, as it already happens with most of the other > modules. > > I'm not sure it's a good idea for a module with more than 10000 lines > of C code (and 4000 lines of pure Python code). This is several times > the size of multiprocessing. The C code looks very cleanly written, but > it's still a big chunk of algorithmically sophisticated code. > Even unicodeobject.c is 10k+ lines of C code and I got familiar with (parts of) it just by fixing bugs in specific functions. I took a look at the regex code and it seems clear, with enough comments and several small functions that are easy to follow and understand. multiprocessing requires good knowledge of a number of concepts and platform-specific issues that makes it more difficult to understand and maintain (but maybe regex-related concepts seems easier to me because I'm already familiar with them). I think it would be good to: 1) have some document that explains the general design and main (internal) functions of the module (e.g. a PEP); 2) make a review on rietveld (possibly only of the diff with re, to limit the review to the new code only), so that people can ask questions, discuss and understand the code; 3) possibly update the document/PEP with the outcome of the rietveld review(s) and/or address the issues discussed (if any); 4) add documentation for the module and the (public) functions in Doc/library (this should be done anyway). This will ensure that the general quality of the code is good, and when someone actually has to work on the code, there's enough documentation to make it possible. Best Regards, Ezio Melotti > > Another "interesting" question is whether it's easy to port to the PEP > 393 string representation, if it gets accepted. > > Regards > > Antoine. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Sun Aug 28 06:28:17 2011 From: guido at python.org (Guido van Rossum) Date: Sat, 27 Aug 2011 21:28:17 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 8:59 PM, Ezio Melotti wrote: > On Sat, Aug 27, 2011 at 4:56 AM, Antoine Pitrou wrote: >> >> On Sat, 27 Aug 2011 04:37:21 +0300 >> Ezio Melotti wrote: >> > >> > I'm not sure it's worth doing an extensive review of the code, a better >> > approach might be to require extensive test coverage ?(and a review of >> > tests). ?If the code seems well written, commented, documented (I think >> > proper rst documentation is still missing), >> >> Isn't this precisely what a review is supposed to assess? > > This can be done without actually knowing and understanding every single > function in the module (I got the impression that someone wants this kind of > review, correct me if I'm wrong). Wasn't me. I've long given up expecting to understand every line of code in CPython. I'm happy if the code is written in a way that makes it possible to read and understand it as the need arises. >> > We will get familiar with the code once we start contributing >> > to it and fixing bugs, as it already happens with most of the other >> > modules. >> >> I'm not sure it's a good idea for a module with more than 10000 lines >> of C code (and 4000 lines of pure Python code). This is several times >> the size of multiprocessing. The C code looks very cleanly written, but >> it's still a big chunk of algorithmically sophisticated code. > > Even unicodeobject.c is 10k+ lines of C code and I got familiar with (parts > of) it just by fixing bugs in specific functions. > I took a look at the regex code and it seems clear, with enough comments and > several small functions that are easy to follow and understand. > multiprocessing requires good knowledge of a number of concepts and > platform-specific issues that makes it more difficult to understand and > maintain (but maybe regex-related concepts seems easier to me because I'm > already familiar with them). Are you volunteering? (Even if you don't want to be the only maintainer, it still sounds like you'd be a good co-maintainer of the regex module.) > I think it would be good to: > ? 1) have some document that explains the general design and main (internal) > functions of the module (e.g. a PEP); I don't think that such a document needs to be a PEP; PEPs are usually intended where there is significant discussion expected, not just to explain things. A README file or a Wiki page would be fine, as long as it's sufficiently comprehensive. > ? 2) make a review on rietveld (possibly only of the diff with re, to limit > the review to the new code only), so that people can ask questions, discuss > and understand the code; That would be an interesting exercise indeed. > ? 3) possibly update the document/PEP with the outcome of the rietveld > review(s) and/or address the issues discussed (if any); Yeah, of course. > ? 4) add documentation for the module and the (public) functions in > Doc/library (this should be done anyway). Does regex have a significany public C interface? (_sre.c doesn't.) Does it have a Python-level interface beyond what re.py offers (apart from the obvious new flags and new regex syntax/semantics)? > This will ensure that the general quality of the code is good, and when > someone actually has to work on the code, there's enough documentation to > make it possible. That sounds like a good description of a process that could lead to acceptance of regex as a re replacement. >> Another "interesting" question is whether it's easy to port to the PEP >> 393 string representation, if it gets accepted. It's very likely that PEP 393 is accepted. So likely, in fact, that I would recommend that you start porting regex to PEP 393 now. The experience would benefit both your understanding of the regex module and the quality of the PEP and its implementation. I like what I hear here! -- --Guido van Rossum (python.org/~guido) From tjreedy at udel.edu Sun Aug 28 06:58:35 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 28 Aug 2011 00:58:35 -0400 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: Dan, I once had the more or less the same opinion/question as you with regard to ctypes, but I now see at least 3 problems. 1) It seems hard to write it correctly. There are currently 47 open ctypes issues, with 9 being feature requests, leaving 38 behavior-related issues. Tom Heller has not been able to work on it since the beginning of 2010 and has formally withdrawn as maintainer. No one else that I know of has taken his place. 2) It is not trivial to use it correctly. I think it needs a SWIG-like companion script that can write at least first-pass ctypes code from the .h header files. Or maybe it could/should use header info at runtime (with the .h bundled with a module). 3) It seems to be slower than compiled C extension wrappers. That, at least, was the discovery of someone who re-wrote pygame using ctypes. (The hope was that using ctypes would aid porting to 3.x, but the time penalty was apparently too much for time-critical code.) If you want to see more use of ctypes in the Python community (though not necessarily immediately in the stdlib), feel free to work on any one of these problems. A fourth problem is that people capable of working on ctypes are also capable of writing C extensions, and most prefer that. Or some work on Cython, which is a third solution. -- Terry Jan Reedy From tjreedy at udel.edu Sun Aug 28 07:27:39 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Sun, 28 Aug 2011 01:27:39 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> <4E59801E.7080406@canterbury.ac.nz> Message-ID: On 8/27/2011 11:54 PM, Guido van Rossum wrote: >> If so, it would be like the decimal >> module, which closely tracks the IEEE decimal standard, rather than the >> binary float standard. > > Well, I would hope that for each "major" Python version (i.e. 3.2, > 3.3, 3.4, ...) we would pick a specific version of the Unicode > standard and declare our desire to be compliant with that Unicode > standard version, and not switch allegiances in some bugfix version > (e.g. 3.2.3, 3.3.1, ...). Definitely. The unicode version would have to be frozen with beta 1 if not before. (I am quite sure the decimal module also freezes the IEEE standard version *it* follows for each Python version.) In my view, x.y is a version of the Python language while the x.y.z CPython releases are progressively better implementations of that one language, starting with x.y.0. This is the main reason I suggested that the first CPython release for the 3.3 language be called 3.3.0, as it now is. In this view, there is no question of an x.y.z+1 release changing the definition of the x.y language. -- Terry Jan Reedy From drsalists at gmail.com Sun Aug 28 07:36:41 2011 From: drsalists at gmail.com (Dan Stromberg) Date: Sat, 27 Aug 2011 22:36:41 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 8:57 PM, Guido van Rossum wrote: > On Sat, Aug 27, 2011 at 3:14 PM, Dan Stromberg > wrote: > > IMO, we really, really need some common way of accessing C libraries that > > works for all major Python variants. > > We have one. It's called writing an extension module. > And yet Cext's are full of CPython-isms. I've said in the past that Python has been lucky in that it had only a single implementation for a long time, but still managed to escape becoming too defined by the idiosyncrasies of that implementation - that's quite impressive, and is probably our best indication that Python has had leadership with foresight. In the language proper, I'd say I still believe this, but Cext's are sadly not a good example. > ctypes is a crutch because it doesn't realistically have access to the > header files. Well, actually, header files are pretty easy to come by. I bet you've installed them yourself many times. In fact, you've probably even automatically brought some of them in via a package management system of one form or another without getting your hands dirty. As a thought experiment, imagine having a ctypes configuration system that looks around a computer for .h's and .so's (etc) with even 25% of the effort expended by GNU autoconf. Instead of building the results into a bunch of .o's, the results are saved in a .ct file or something. If you build-in some reasonable default locations to look in, plus the equivalent of some -I's and -L's (and maybe -rpath's) as needed, you probably end up with a pretty comparable system. (typedef's might be a harder problem - that's particularly worth discussing, IMO - your chance to nip this in the bud with a reasoned explanation why they can't be handled well!) It's a fine crutch for PyPy, which doesn't have much of > an alternative. Wait - a second ago I thought I was to believe that C extension modules were the one true way of interfacing with C code across all major implementations? Are we perhaps saying that CPython is "the" major implementation, and that we want it to stay that way? I personally feel that PyPy has arrived as a major implementation. The backup program I've been writing in my spare time runs great on PyPy (and the CPython's from 2.5.x, and pretty well on Jython). And PyPy has been maturing very rapidly ('just wish they'd do 3.x!). It's also a fine crutch for people who need something > to run *now*. It's a horrible strategy for the standard library. > I guess I'm coming to see this as dogma. If ctypes is augmented with type information and/or version information and where to find things, wouldn't it Become safe and convenient? Or do you have other concerns? Make a list of things that can go wrong with ctypes modules. Now make a list of things that can wrong with C extension modules. Aren't they really pretty similar - missing .so, .so in a weird place, and especially: .so with a changed interface? C really isn't a very safe language - not like http://en.wikipedia.org/wiki/Turing_%28programming_language%29 or something. Perhaps it's a little easier to mess things up with ctypes today (a recompile doesn't fix, or at least detect, as many problems), but isn't it at least worth Thinking about how that situation could be improved? If you have a better proposal please do write it up. But so far you > are mostly exposing your ignorance and insisting dramatically that you > be educated. > I'm not sure why you're trying to avoid having a discussion. I think it's premature to dive into a proposal before getting other people's thoughts. Frankly, 100 people tend to think better than one - at least, if the 100 people feel like they can talk. I'm -not- convinced ctypes are the way forward. I just want to talk about it - for now. ctypes have some significant advantages - if we can find a way to eliminate and/or ameliorate their disadvantages, they might be quite a bit nicer than Cext's. -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Sun Aug 28 07:58:02 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 28 Aug 2011 07:58:02 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: <4E59D8EA.4080306@v.loewis.de> > I just want to talk about it - for now. python-ideas is a better place to just talk than python-dev. Regards, Martin From stefan_ml at behnel.de Sun Aug 28 08:50:21 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 28 Aug 2011 08:50:21 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: Dan Stromberg, 27.08.2011 21:58: > On Sat, Aug 27, 2011 at 9:04 AM, Nick Coghlan wrote: >> On Sun, Aug 28, 2011 at 1:58 AM, Nadeem Vawda wrote: >>> On Sat, Aug 27, 2011 at 5:52 PM, Nick Coghlan wrote: >>>> It's acceptable for the Python version to use ctypes in the case of >>>> wrapping an existing library, but the Python version should still >>>> exist. >>> >>> I'm not too sure about that - PEP 399 explicitly says that using ctypes >>> is >>> frowned upon, and doesn't mention anywhere that it should be used in this >>> sort of situation. >> >> Note to self: do not comment on python-dev at 2 am, as one's ability >> to read PEPs correctly apparently suffers :) >> >> Consider my comment withdrawn, you're quite right that PEP 399 >> actually says this is precisely the case where an exemption is a >> reasonable idea. Although I believe it's likely that PyPy will wrap it >> with ctypes anyway :) > > I'd like to better understand why ctypes is (sometimes) frowned upon. > > Is it the brittleness? Tendency to segfault? Maybe unwieldy code and slow execution on CPython? Note that there's a ctypes backend for Cython being written as part of a GSoC, so it should eventually become possible to write C library wrappers in Cython and have it generate a ctypes version to run on PyPy. That, together with the IronPython backend that is on its way, would give you a way to write fast wrappers for at least three of the major four Python implementations, without sacrificing readability or speed in one of them. Stefan From ncoghlan at gmail.com Sun Aug 28 09:57:24 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 28 Aug 2011 17:57:24 +1000 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> Message-ID: On Sun, Aug 28, 2011 at 2:28 PM, Guido van Rossum wrote: > On Sat, Aug 27, 2011 at 8:59 PM, Ezio Melotti wrote: >> I think it would be good to: >> ? 1) have some document that explains the general design and main (internal) >> functions of the module (e.g. a PEP); > > I don't think that such a document needs to be a PEP; PEPs are usually > intended where there is significant discussion expected, not just to > explain things. A README file or a Wiki page would be fine, as long as > it's sufficiently comprehensive. timsort.txt and dictnotes.txt may be useful precedents for the kind of thing that is useful on that front. IIRC, the pymalloc stuff has a massive embedded comment, which can also work. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From hodgestar+pythondev at gmail.com Sun Aug 28 13:40:43 2011 From: hodgestar+pythondev at gmail.com (Simon Cross) Date: Sun, 28 Aug 2011 13:40:43 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Sun, Aug 28, 2011 at 6:58 AM, Terry Reedy wrote: > 2) It is not trivial to use it correctly. I think it needs a SWIG-like > companion script that can write at least first-pass ctypes code from the .h > header files. Or maybe it could/should use header info at runtime (with the > .h bundled with a module). This is sort of already available: -- http://starship.python.net/crew/theller/ctypes/old/codegen.html -- http://svn.python.org/projects/ctypes/trunk/ctypeslib/ It just appears to have never made it into CPython. I've used it successfully on a small project. Schiavo Simon From guido at python.org Sun Aug 28 18:43:33 2011 From: guido at python.org (Guido van Rossum) Date: Sun, 28 Aug 2011 09:43:33 -0700 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: On Sat, Aug 27, 2011 at 6:08 AM, Armin Rigo wrote: > Hi Nick, > > On Sat, Aug 27, 2011 at 2:40 PM, Nick Coghlan wrote: >> 1. How does the patch interact with C code that explicitly releases >> the GIL? (e.g. IO commands inside a "with atomic:" block) > > As implemented, any code in a "with atomic" is prevented from > explicitly releasing and reacquiring the GIL: the GIL remain acquired > until the end of the "with" block. ?In other words > Py_BEGIN_ALLOW_THREADS has no effect in a "with" block. ?This gives > semantics that, in a full multi-core STM world, would be implementable > by saying that if, in the middle of a transaction, you need to do I/O, > then from this point onwards the transaction is not allowed to abort > any more. ?Such "inevitable" transactions are already supported e.g. > by RSTM, the C++ framework I used to prototype a C version > (https://bitbucket.org/arigo/arigo/raw/default/hack/stm/c ). > >> 2. Whether or not Jython and IronPython could implement something like >> that, since they're free threaded with fine-grained locks. If they >> can't then I don't see how we could justify making it part of the >> standard library. > > Yes, I can imagine some solutions. ?I am no Jython or IronPython > expert, but let us assume that they have a way to check synchronously > for external events from time to time (i.e. if there is some > equivalent to sys.setcheckinterval()). ?If they do, then all you need > is the right synchronization: the thread that wants to start a "with > atomic" has to wait until all other threads are paused in the external > check code. ?(Again, like CPython's, this not a properly multi-core > STM-ish solution, but it would give the right semantics. ?(And if it > turns out that STM is successful in the future, Java will grow more > direct support for it )) > > > A bient?t, > > Armin. This sounds like a very interesting idea to pursue, even if it's late, and even if it's experimental, and even if it's possible to cause deadlocks (no news there). I propose that we offer a C API in Python 3.3 as well as an extension module that offers the proposed decorator. The C API could then be used to implement alternative APIs purely as extension modules (e.g. would a deadlock-detecting API be possible?). I don't think this needs a PEP, it's not a very pervasive change. We can even document the API as experimental. But (if I may trust Armin's reasoning) it's important to add support directly to CPython, as currently it cannot be done as a pure extension module. -- --Guido van Rossum (python.org/~guido) From guido at python.org Sun Aug 28 18:53:00 2011 From: guido at python.org (Guido van Rossum) Date: Sun, 28 Aug 2011 09:53:00 -0700 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> <20110827035916.583c3d81@pitrou.net> <4E5868D6.8090203@pearwood.info> <4E59801E.7080406@canterbury.ac.nz> <20110828075246.GG99611@nexus.in-nomine.org> Message-ID: Someone asked me off-line what I wanted besides talk. Here's the list I came up with: You could try for instance volunteer to do a thorough code review of the regex code, trying to think of ways to break it (e.g. bad syntax or extreme use of nesting etc., or bad data). Or you could volunteer to maintain it in the future. Or you could try to port it to PEP 393. Or you could systematically go over the given list of differences between re and regex and decide whether they are likely to be backwards incompatibilities that will break existing code. Or you could try to add some of the functionality requested by Tom C in one of his several bugs. -- --Guido van Rossum (python.org/~guido) From guido at python.org Sun Aug 28 19:12:56 2011 From: guido at python.org (Guido van Rossum) Date: Sun, 28 Aug 2011 10:12:56 -0700 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: On Sat, Aug 27, 2011 at 10:36 PM, Dan Stromberg wrote: > > On Sat, Aug 27, 2011 at 8:57 PM, Guido van Rossum wrote: >> >> On Sat, Aug 27, 2011 at 3:14 PM, Dan Stromberg >> wrote: >> > IMO, we really, really need some common way of accessing C libraries >> > that >> > works for all major Python variants. >> >> We have one. It's called writing an extension module. > > And yet Cext's are full of CPython-isms. I have to apologize, I somehow misread your "all Python variants" as a mixture of "all CPython versions" and "all platforms where CPython runs". While I have no desire to continue this discussion, you are most welcome to do so. -- --Guido van Rossum (python.org/~guido) From martin at v.loewis.de Sun Aug 28 20:13:03 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 28 Aug 2011 20:13:03 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> Message-ID: <4E5A852F.7040206@v.loewis.de> Am 26.08.2011 16:56, schrieb Guido van Rossum: > Also, please add the table (and the reasoning that led to it) to the PEP. Done! Martin From stefan_ml at behnel.de Sun Aug 28 20:23:35 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 28 Aug 2011 20:23:35 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: Hi, sorry for hooking in here with my usual Cython bias and promotion. When the question comes up what a good FFI for Python should look like, it's an obvious reaction from my part to throw Cython into the game. Terry Reedy, 28.08.2011 06:58: > Dan, I once had the more or less the same opinion/question as you with > regard to ctypes, but I now see at least 3 problems. > > 1) It seems hard to write it correctly. There are currently 47 open ctypes > issues, with 9 being feature requests, leaving 38 behavior-related issues. > Tom Heller has not been able to work on it since the beginning of 2010 and > has formally withdrawn as maintainer. No one else that I know of has taken > his place. Cython has an active set of developers and a rather large and growing user base. It certainly has lots of open issues in its bug tracker, but most of them are there because we *know* where the development needs to go, not so much because we don't know how to get there. After all, the semantics of Python and C/C++, between which Cython sits, are pretty much established. Cython compiles to C code for CPython, (hopefully soon [1]) to Python+ctypes for PyPy and (mostly [2]) C++/CLI code for IronPython, which boils down to the same build time and runtime kind of dependencies that the supported Python runtimes have anyway. It does not add dependencies on any external libraries by itself, such as the libffi in CPython's ctypes implementation. For the CPython backend, the generated code is very portable and is self-contained when compiled against the CPython runtime (plus, obviously, libraries that the user code explicitly uses). It generates efficient code for all existing CPython versions starting with Python 2.4, with several optimisations also for recent CPython versions (including the upcoming 3.3). > 2) It is not trivial to use it correctly. Cython is basically Python, so Python developers with some C or C++ knowledge tend to get along with it quickly. I can't say yet how easy it is (or will be) to write code that is portable across independent Python implementations, but given that that field is still young, there's certainly a lot that can be done to aid this. > I think it needs a SWIG-like > companion script that can write at least first-pass ctypes code from the .h > header files. Or maybe it could/should use header info at runtime (with the > .h bundled with a module). From my experience, this is a "nice to have" more than a requirement. It has been requested for Cython a couple of times, especially by new users, and there are a couple of scripts out there that do this to some extent. But the usual problem is that Cython users (and, similarly, ctypes users) do not want a 1:1 mapping of a library API to a Python API (there's SWIG for that), and you can't easily get more than a trivial mapping out of a script. But, yes, a one-shot generator for the necessary declarations would at least help in cases where the API to be wrapped is somewhat large. > 3) It seems to be slower than compiled C extension wrappers. That, at > least, was the discovery of someone who re-wrote pygame using ctypes. (The > hope was that using ctypes would aid porting to 3.x, but the time penalty > was apparently too much for time-critical code.) Cython code can be as fast as C code, and in some cases, especially when developer time is limited, even faster than hand written C extensions. It allows for a straight forward optimisation path from regular Python code down to the speed of C, and trivial interaction with C code itself, if the need arises. Stefan [1] The PyPy port of Cython is currently being written as a GSoC project. [2] The IronPython port of Cython was written to facility a NumPy port to the .NET environment. It's currently not a complete port of all Cython features. From solipsis at pitrou.net Sun Aug 28 21:07:54 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 28 Aug 2011 21:07:54 +0200 Subject: [Python-Dev] peps: Add memory consumption table. References: Message-ID: <20110828210754.4bec2e92@pitrou.net> On Sun, 28 Aug 2011 20:13:11 +0200 martin.v.loewis wrote: > > +Performance > +----------- > + > +Performance of this patch must be considered for both memory > +consumption and runtime efficiency. For memory consumption, the > +expectation is that applications that have many large strings will see > +a reduction in memory usage. For small strings, the effects depend on > +the pointer size of the system, and the size of the Py_UNICODE/wchar_t > +type. The following table demonstrates this for various small string > +sizes and platforms. The table is for ASCII-only strings, right? Perhaps that should be mentioned somewhere. Regards Antoine. From martin at v.loewis.de Sun Aug 28 21:47:05 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 28 Aug 2011 21:47:05 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <20110825132734.1c236d17@pitrou.net> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> Message-ID: <4E5A9B39.8090009@v.loewis.de> > I would say no more than a 15% slowdown on each of the following > benchmarks: > > - stringbench.py -u > (http://svn.python.org/view/sandbox/trunk/stringbench/) > - iobench.py -t > (in Tools/iobench/) > - the json_dump, json_load and regex_v8 tests from > http://hg.python.org/benchmarks/ I now have benchmark results for these; numbers are for revision c10bcab2aac7, comparing to 1ea72da11724 (wide unicode), on 64-bit Linux with gcc 4.6.1 running on Core i7 2.8GHz. - stringbench gives 10% slowdown on total time; the tests take between 78% and 220%. The cost is typically not in performing the string operations themselves, but in the creation of the result strings. In PEP 393, a buffer must be scanned for the highest code point, which means that each byte must be inspected twice (a second time when the copying occurs). - the iobench results are between 2% acceleration (seek operations), 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed difference is probably in the UTF-8 decoder; I have already restored the "runs of ASCII" optimization and am out of ideas for further speedups. Again, having to scan the UTF-8 string twice is probably one cause of slowdown. - the json and regex_v8 tests see a slowdown of below 1%. The slowdown is larger when compared with a narrow Unicode build. > Additionally, it would be nice if you could run at least some of the > test_bigmem tests, according to your system's available RAM. Running only StrTest with 4.5G allows me to run 2 tests (test_encode_raw_unicode_escape and test_encode_utf7); this sees a slowdown of 37% in Linux user time. Regards, Martin From solipsis at pitrou.net Sun Aug 28 22:01:06 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 28 Aug 2011 22:01:06 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5A9B39.8090009@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> Message-ID: <1314561666.3656.3.camel@localhost.localdomain> > - the iobench results are between 2% acceleration (seek operations), > 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and > 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed > difference is probably in the UTF-8 decoder; I have already > restored the "runs of ASCII" optimization and am out of ideas for > further speedups. Again, having to scan the UTF-8 string twice > is probably one cause of slowdown. I don't think it's the UTF-8 decoder because I see an even larger slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le"). Thanks Antoine. From martin at v.loewis.de Sun Aug 28 22:23:42 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 28 Aug 2011 22:23:42 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <1314561666.3656.3.camel@localhost.localdomain> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <1314561666.3656.3.camel@localhost.localdomain> Message-ID: <4E5AA3CE.50503@v.loewis.de> Am 28.08.2011 22:01, schrieb Antoine Pitrou: > >> - the iobench results are between 2% acceleration (seek operations), >> 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and >> 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed >> difference is probably in the UTF-8 decoder; I have already >> restored the "runs of ASCII" optimization and am out of ideas for >> further speedups. Again, having to scan the UTF-8 string twice >> is probably one cause of slowdown. > > I don't think it's the UTF-8 decoder because I see an even larger > slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le"). But those aren't used in iobench, are they? Regards, Martin From solipsis at pitrou.net Sun Aug 28 22:27:20 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 28 Aug 2011 22:27:20 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5AA3CE.50503@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <1314561666.3656.3.camel@localhost.localdomain> <4E5AA3CE.50503@v.loewis.de> Message-ID: <1314563240.3656.6.camel@localhost.localdomain> Le dimanche 28 ao?t 2011 ? 22:23 +0200, "Martin v. L?wis" a ?crit : > Am 28.08.2011 22:01, schrieb Antoine Pitrou: > > > >> - the iobench results are between 2% acceleration (seek operations), > >> 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and > >> 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed > >> difference is probably in the UTF-8 decoder; I have already > >> restored the "runs of ASCII" optimization and am out of ideas for > >> further speedups. Again, having to scan the UTF-8 string twice > >> is probably one cause of slowdown. > > > > I don't think it's the UTF-8 decoder because I see an even larger > > slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le"). > > But those aren't used in iobench, are they? I was not very clear, but you can change the encoding used in iobench by using the "-E" command-line option (while UTF-8 is the default if you don't specify anything). For example: $ ./python Tools/iobench/iobench.py -t -E latin1 Preparing files... Text unit = one character (latin1-decoded) ** Text input ** [ 400KB ] read one unit at a time... 5.17 MB/s [ 400KB ] read 20 units at a time... 77.6 MB/s [ 400KB ] read one line at a time... 209 MB/s [ 400KB ] read 4096 units at a time... 509 MB/s [ 20KB ] read whole contents at once... 885 MB/s [ 400KB ] read whole contents at once... 730 MB/s [ 10MB ] read whole contents at once... 726 MB/s (etc.) Regards Antoine. From martin at v.loewis.de Sun Aug 28 23:06:34 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 28 Aug 2011 23:06:34 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <1314561666.3656.3.camel@localhost.localdomain> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <1314561666.3656.3.camel@localhost.localdomain> Message-ID: <4E5AADDA.5090206@v.loewis.de> Am 28.08.2011 22:01, schrieb Antoine Pitrou: > >> - the iobench results are between 2% acceleration (seek operations), >> 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and >> 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed >> difference is probably in the UTF-8 decoder; I have already >> restored the "runs of ASCII" optimization and am out of ideas for >> further speedups. Again, having to scan the UTF-8 string twice >> is probably one cause of slowdown. > > I don't think it's the UTF-8 decoder because I see an even larger > slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le"). Those haven't been ported to the new API, yet. Consider, for example, d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this is a 25% speedup for PEP 393. Regards, Martin From greg.ewing at canterbury.ac.nz Mon Aug 29 00:24:04 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Mon, 29 Aug 2011 10:24:04 +1200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> Message-ID: <4E5AC004.8030103@canterbury.ac.nz> Guido van Rossum wrote: > On Sat, Aug 27, 2011 at 3:14 PM, Dan Stromberg wrote: > >>IMO, we really, really need some common way of accessing C libraries that >>works for all major Python variants. > > We have one. It's called writing an extension module. I think Dan means some way of doing this without having to hand-craft a different one for each Python implementation. If we're really serious about the idea that "Python is not CPython", this seems like a reasonable thing to want. Currently the Python universe is very much centred around CPython, with the other implementations perpetually in catch-up mode. My suggestion on how to address this would be something akin to Pyrex or Cython. I gather that there has been some work recently on adding different back-ends to Cython to generate code for different Python implementations. -- Greg From stephen at xemacs.org Mon Aug 29 04:20:12 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 29 Aug 2011 11:20:12 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87liudklgj.fsf@uwakimon.sk.tsukuba.ac.jp> Paul Moore writes: > IronPython and Jython can retain UTF-16 as their native form if that > makes interop cleaner, but in doing so they need to ensure that basic > operations like indexing and len work in terms of code points, not > code units, if they are to conform. [...] > They lose the O(1) guarantee, but that's easily defensible as a > tradeoff to conform to underlying runtime semantics. Unfortunately, I don't think it's all that easy to defend. Absent PEP 393 or a restriction to the characters in the BMP, this is a very expensive change, easily visible to interactive users, let alone performance-hungry applications. I personally do advocate the "array of code points" definition, but I don't use IronPython or Jython so PEP 393 is as close to heaven as I expect to get. OTOH, I also use Emacsen with Mule, and I have to admit that there is a perceptible performance hit in any large (>1 MB) buffer containing non-ASCII characters vs. pure ASCII (the code unit in Mule is 1 byte). I expect that if IronPython and Jython really want to retain native, code-unit-based representations, it's going to be painful to conform to an "array of code points" specification. There may need to be a compromise of the form "Implementations SHOULD provide an implementation of str that is both O(1) in indexing and an array of code points. Code that is Unicode-ly correct in Python implementing PEP 393 will need to be ported with some effort to implementations that do not satisfy this requirement, perhaps using different algorithms or extra libraries." From guido at python.org Mon Aug 29 04:27:16 2011 From: guido at python.org (Guido van Rossum) Date: Sun, 28 Aug 2011 19:27:16 -0700 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Sun, Aug 28, 2011 at 11:23 AM, Stefan Behnel wrote: > Hi, > > sorry for hooking in here with my usual Cython bias and promotion. When the > question comes up what a good FFI for Python should look like, it's an > obvious reaction from my part to throw Cython into the game. > > Terry Reedy, 28.08.2011 06:58: >> >> Dan, I once had the more or less the same opinion/question as you with >> regard to ctypes, but I now see at least 3 problems. >> >> 1) It seems hard to write it correctly. There are currently 47 open ctypes >> issues, with 9 being feature requests, leaving 38 behavior-related issues. >> Tom Heller has not been able to work on it since the beginning of 2010 and >> has formally withdrawn as maintainer. No one else that I know of has taken >> his place. > > Cython has an active set of developers and a rather large and growing user > base. > > It certainly has lots of open issues in its bug tracker, but most of them > are there because we *know* where the development needs to go, not so much > because we don't know how to get there. After all, the semantics of Python > and C/C++, between which Cython sits, are pretty much established. > > Cython compiles to C code for CPython, (hopefully soon [1]) to Python+ctypes > for PyPy and (mostly [2]) C++/CLI code for IronPython, which boils down to > the same build time and runtime kind of dependencies that the supported > Python runtimes have anyway. It does not add dependencies on any external > libraries by itself, such as the libffi in CPython's ctypes implementation. > > For the CPython backend, the generated code is very portable and is > self-contained when compiled against the CPython runtime (plus, obviously, > libraries that the user code explicitly uses). It generates efficient code > for all existing CPython versions starting with Python 2.4, with several > optimisations also for recent CPython versions (including the upcoming 3.3). > > >> 2) It is not trivial to use it correctly. > > Cython is basically Python, so Python developers with some C or C++ > knowledge tend to get along with it quickly. > > I can't say yet how easy it is (or will be) to write code that is portable > across independent Python implementations, but given that that field is > still young, there's certainly a lot that can be done to aid this. Cythin does sound attractive for cross-Python-implementation use. This is exciting. >> I think it needs a SWIG-like >> companion script that can write at least first-pass ctypes code from the .h >> header files. Or maybe it could/should use header info at runtime (with the >> .h bundled with a module). > > From my experience, this is a "nice to have" more than a requirement. It has > been requested for Cython a couple of times, especially by new users, and > there are a couple of scripts out there that do this to some extent. But the > usual problem is that Cython users (and, similarly, ctypes users) do not > want a 1:1 mapping of a library API to a Python API (there's SWIG for that), > and you can't easily get more than a trivial mapping out of a script. But, > yes, a one-shot generator for the necessary declarations would at least help > in cases where the API to be wrapped is somewhat large. Hm, the main use that was proposed here for ctypes is to wrap existing libraries (not to create nicer APIs, that can be done in pure Python on top of this). In general, an existing library cannot be called without access to its .h files -- there are probably struct and constant definitions, platform-specific #ifdefs and #defines, and other things in there that affect the linker-level calling conventions for the functions in the library. (Just like Python's own .h files -- e.g. the extensive renaming of the Unicode APIs depending on narrow/wide build) How does Cython deal with these? I wonder if for this particular purpose SWIG isn't the better match. (If SWIG weren't universally hated, even by its original author. :-) >> 3) It seems to be slower than compiled C extension wrappers. That, at >> least, was the discovery of someone who re-wrote pygame using ctypes. (The >> hope was that using ctypes would aid porting to 3.x, but the time penalty >> was apparently too much for time-critical code.) > > Cython code can be as fast as C code, and in some cases, especially when > developer time is limited, even faster than hand written C extensions. It > allows for a straight forward optimisation path from regular Python code > down to the speed of C, and trivial interaction with C code itself, if the > need arises. > > Stefan > > > [1] The PyPy port of Cython is currently being written as a GSoC project. > > [2] The IronPython port of Cython was written to facility a NumPy port to > the .NET environment. It's currently not a complete port of all Cython > features. -- --Guido van Rossum (python.org/~guido) From stephen at xemacs.org Mon Aug 29 04:48:43 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 29 Aug 2011 11:48:43 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> Message-ID: <87k49xkk50.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > I don't think anyone else has that impression. Please cite chapter and > verse if you really think this is important. IIUC, UCS-2 does not > allow surrogate pairs, In the original definition of UCS-2 in draft ISO 10646 (1990), everything in the BMP except for 0xFFFF and 0xFFFE was a character, and there was no concept of "surrogate" at all. Later in ISO 10646 (1993)[1], the Surrogate Area was carved out of the Private Area, but UCS-2 implementations simply treat them as (single) characters with special properties. This was more or less backward compatible as all corporate uses of the private area used the lower code points and didn't conflict with the surrogates. Finally (in 2000 or 2003) the definition of UCS-2 in ISO 10646 was revised in a backward- incompatible way to exclude surrogates entirely, ie, nowadays it is a range-restricted version of UTF-16. Footnotes: [1] IIRC, strictly speaking this was done slightly later (1993 or 1994) in an official Amendment to ISO 10646; the Amendment was incorporated into the standard in 2000. From ncoghlan at gmail.com Mon Aug 29 04:59:41 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 29 Aug 2011 12:59:41 +1000 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Mon, Aug 29, 2011 at 12:27 PM, Guido van Rossum wrote: > I wonder if for > this particular purpose SWIG isn't the better match. (If SWIG weren't > universally hated, even by its original author. :-) SWIG is nice when you control the C/C++ side of the API as well and can tweak it to be SWIG-friendly. I shudder at the idea of using it to wrap arbitrary C++ code, though. That said, the idea of using SWIG to emit Cython code rather than C/API code may be one well worth exploring. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From stephen at xemacs.org Mon Aug 29 05:43:24 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 29 Aug 2011 12:43:24 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> References: <20110823001440.433a0f1f@pitrou.net> <4E536B0C.8050008@v.loewis.de> <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> Message-ID: <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> Raymond Hettinger writes: > The naming convention for codecs is that the UTF prefix is used for > lossless encodings that cover the entire range of Unicode. Sure. The operative word here is "codec", not "str", though. > "The first amendment to the original edition of the UCS defined > UTF-16, an extension of UCS-2, to represent code points outside the > BMP." Since when can s[0] represent a code point outside the BMP, for s a Unicode string in a narrow build? Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about what Python supports vs. the outside world. It's about what the str/ unicode type is an array of. From glyph at twistedmatrix.com Mon Aug 29 06:46:35 2011 From: glyph at twistedmatrix.com (Glyph Lefkowitz) Date: Sun, 28 Aug 2011 21:46:35 -0700 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: <9FA8683B-FB0A-4F46-878F-11B36F92A342@twistedmatrix.com> On Aug 28, 2011, at 7:27 PM, Guido van Rossum wrote: > In general, an existing library cannot be called > without access to its .h files -- there are probably struct and > constant definitions, platform-specific #ifdefs and #defines, and > other things in there that affect the linker-level calling conventions > for the functions in the library. Unfortunately I don't know a lot about this, but I keep hearing about something called "rffi" that PyPy uses to call C from RPython: . This has some shortcomings currently, most notably the fact that it needs those .h files (and therefore a C compiler) at runtime, so it's currently a non-starter for code distributed to users. Not to mention the fact that, as you can see, it's not terribly thoroughly documented. But, that "ExternalCompilationInfo" object looks very promising, since it has fields like "includes", "libraries", etc. Nevertheless it seems like it's a bit more type-safe than ctypes or cython, and it seems to me that it could cache some of that information that it extracts from header files and store it for later when a compiler might not be around. Perhaps someone with more PyPy knowledge than I could explain whether this is a realistic contender for other Python runtimes? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mal at egenix.com Mon Aug 29 10:00:56 2011 From: mal at egenix.com (M.-A. Lemburg) Date: Mon, 29 Aug 2011 10:00:56 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: <4E5B4738.30008@egenix.com> Guido van Rossum wrote: > On Sun, Aug 28, 2011 at 11:23 AM, Stefan Behnel wrote: >> Hi, >> >> sorry for hooking in here with my usual Cython bias and promotion. When the >> question comes up what a good FFI for Python should look like, it's an >> obvious reaction from my part to throw Cython into the game. >> >> Terry Reedy, 28.08.2011 06:58: >>> >>> Dan, I once had the more or less the same opinion/question as you with >>> regard to ctypes, but I now see at least 3 problems. >>> >>> 1) It seems hard to write it correctly. There are currently 47 open ctypes >>> issues, with 9 being feature requests, leaving 38 behavior-related issues. >>> Tom Heller has not been able to work on it since the beginning of 2010 and >>> has formally withdrawn as maintainer. No one else that I know of has taken >>> his place. >> >> Cython has an active set of developers and a rather large and growing user >> base. >> >> It certainly has lots of open issues in its bug tracker, but most of them >> are there because we *know* where the development needs to go, not so much >> because we don't know how to get there. After all, the semantics of Python >> and C/C++, between which Cython sits, are pretty much established. >> >> Cython compiles to C code for CPython, (hopefully soon [1]) to Python+ctypes >> for PyPy and (mostly [2]) C++/CLI code for IronPython, which boils down to >> the same build time and runtime kind of dependencies that the supported >> Python runtimes have anyway. It does not add dependencies on any external >> libraries by itself, such as the libffi in CPython's ctypes implementation. >> >> For the CPython backend, the generated code is very portable and is >> self-contained when compiled against the CPython runtime (plus, obviously, >> libraries that the user code explicitly uses). It generates efficient code >> for all existing CPython versions starting with Python 2.4, with several >> optimisations also for recent CPython versions (including the upcoming 3.3). >> >> >>> 2) It is not trivial to use it correctly. >> >> Cython is basically Python, so Python developers with some C or C++ >> knowledge tend to get along with it quickly. >> >> I can't say yet how easy it is (or will be) to write code that is portable >> across independent Python implementations, but given that that field is >> still young, there's certainly a lot that can be done to aid this. > > Cythin does sound attractive for cross-Python-implementation use. This > is exciting. > >>> I think it needs a SWIG-like >>> companion script that can write at least first-pass ctypes code from the .h >>> header files. Or maybe it could/should use header info at runtime (with the >>> .h bundled with a module). >> >> From my experience, this is a "nice to have" more than a requirement. It has >> been requested for Cython a couple of times, especially by new users, and >> there are a couple of scripts out there that do this to some extent. But the >> usual problem is that Cython users (and, similarly, ctypes users) do not >> want a 1:1 mapping of a library API to a Python API (there's SWIG for that), >> and you can't easily get more than a trivial mapping out of a script. But, >> yes, a one-shot generator for the necessary declarations would at least help >> in cases where the API to be wrapped is somewhat large. > > Hm, the main use that was proposed here for ctypes is to wrap existing > libraries (not to create nicer APIs, that can be done in pure Python > on top of this). In general, an existing library cannot be called > without access to its .h files -- there are probably struct and > constant definitions, platform-specific #ifdefs and #defines, and > other things in there that affect the linker-level calling conventions > for the functions in the library. (Just like Python's own .h files -- > e.g. the extensive renaming of the Unicode APIs depending on > narrow/wide build) How does Cython deal with these? I wonder if for > this particular purpose SWIG isn't the better match. (If SWIG weren't > universally hated, even by its original author. :-) SIP is an alternative to SWIG: http://www.riverbankcomputing.com/software/sip/intro http://pypi.python.org/pypi/SIP and there are a few others as well: http://wiki.python.org/moin/IntegratingPythonWithOtherLanguages >>> 3) It seems to be slower than compiled C extension wrappers. That, at >>> least, was the discovery of someone who re-wrote pygame using ctypes. (The >>> hope was that using ctypes would aid porting to 3.x, but the time penalty >>> was apparently too much for time-critical code.) >> >> Cython code can be as fast as C code, and in some cases, especially when >> developer time is limited, even faster than hand written C extensions. It >> allows for a straight forward optimisation path from regular Python code >> down to the speed of C, and trivial interaction with C code itself, if the >> need arises. >> >> Stefan >> >> >> [1] The PyPy port of Cython is currently being written as a GSoC project. >> >> [2] The IronPython port of Cython was written to facility a NumPy port to >> the .NET environment. It's currently not a complete port of all Cython >> features. > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2011) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 36 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From dirkjan at ochtman.nl Mon Aug 29 11:03:04 2011 From: dirkjan at ochtman.nl (Dirkjan Ochtman) Date: Mon, 29 Aug 2011 11:03:04 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5A9B39.8090009@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> Message-ID: On Sun, Aug 28, 2011 at 21:47, "Martin v. L?wis" wrote: > ?result strings. In PEP 393, a buffer must be scanned for the > ?highest code point, which means that each byte must be inspected > ?twice (a second time when the copying occurs). This may be a silly question: are there things in place to optimize this for the case where two strings are combined? E.g. highest character in combined string is max(highest character in either of the strings). Also, this PEP makes me wonder if there should be a way to distinguish between language PEPs and (CPython) implementation PEPs, by adding a tag or using the PEP number ranges somehow. Cheers, Dirkjan From victor.stinner at haypocalc.com Mon Aug 29 11:19:48 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Mon, 29 Aug 2011 11:19:48 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> Message-ID: <4E5B59B4.9010207@haypocalc.com> Le 29/08/2011 11:03, Dirkjan Ochtman a ?crit : > On Sun, Aug 28, 2011 at 21:47, "Martin v. L?wis" wrote: >> result strings. In PEP 393, a buffer must be scanned for the >> highest code point, which means that each byte must be inspected >> twice (a second time when the copying occurs). > > This may be a silly question: are there things in place to optimize > this for the case where two strings are combined? E.g. highest > character in combined string is max(highest character in either of the > strings). The "double-scan" issue is only for codec decoders. If you combine two Unicode objects (a+b), you already know the highest code point and the kind of each string. Victor From victor.stinner at haypocalc.com Mon Aug 29 10:52:52 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Mon, 29 Aug 2011 10:52:52 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5AADDA.5090206@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <1314561666.3656.3.camel@localhost.localdomain> <4E5AADDA.5090206@v.loewis.de> Message-ID: <4E5B5364.9040100@haypocalc.com> Le 28/08/2011 23:06, "Martin v. L?wis" a ?crit : > Am 28.08.2011 22:01, schrieb Antoine Pitrou: >> >>> - the iobench results are between 2% acceleration (seek operations), >>> 16% slowdown for small-sized reads (4.31MB/s vs. 5.22 MB/s) and >>> 37% for large sized reads (154 MB/s vs. 235 MB/s). The speed >>> difference is probably in the UTF-8 decoder; I have already >>> restored the "runs of ASCII" optimization and am out of ideas for >>> further speedups. Again, having to scan the UTF-8 string twice >>> is probably one cause of slowdown. >> >> I don't think it's the UTF-8 decoder because I see an even larger >> slowdown with simpler encodings (e.g. "-E latin1" or "-E utf-16le"). > > Those haven't been ported to the new API, yet. Consider, for example, > d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; > with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this > is a 25% speedup for PEP 393. If I understand correctly, the performance now highly depend on the used characters? A pure ASCII string is faster than a string with characters in the ISO-8859-1 charset? Is it also true for BMP characters vs non-BMP characters? Do these benchmark tools use only ASCII characters, or also some ISO-8859-1 characters? Or, better, different Unicode ranges in different tests? Victor From arigo at tunes.org Mon Aug 29 11:36:27 2011 From: arigo at tunes.org (Armin Rigo) Date: Mon, 29 Aug 2011 11:36:27 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: Hi Guido, On Sun, Aug 28, 2011 at 6:43 PM, Guido van Rossum wrote: > This sounds like a very interesting idea to pursue, even if it's late, > and even if it's experimental, and even if it's possible to cause > deadlocks (no news there). I propose that we offer a C API in Python > 3.3 as well as an extension module that offers the proposed decorator. Very good idea. http://bugs.python.org/issue12850 The extension module, called 'stm' for now, is designed as an independent 3rd-party extension module. It should at this point not be included in the stdlib; for one thing, it needs some more testing than my quick one-page hacks, and we need to seriously look at the deadlock issues mentioned here. But the patch to ceval.c above looks rather straightforward to me and could, if no subtle issue is found, be included in the standard CPython. A bient?t, Armin. From stefan_ml at behnel.de Mon Aug 29 11:39:12 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 29 Aug 2011 11:39:12 +0200 Subject: [Python-Dev] Ctypes and the stdlib In-Reply-To: References: <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: Guido van Rossum, 29.08.2011 04:27: > On Sun, Aug 28, 2011 at 11:23 AM, Stefan Behnel wrote: >> Terry Reedy, 28.08.2011 06:58: >>> I think it needs a SWIG-like >>> companion script that can write at least first-pass ctypes code from the .h >>> header files. Or maybe it could/should use header info at runtime (with the >>> .h bundled with a module). >> >> From my experience, this is a "nice to have" more than a requirement. It has >> been requested for Cython a couple of times, especially by new users, and >> there are a couple of scripts out there that do this to some extent. But the >> usual problem is that Cython users (and, similarly, ctypes users) do not >> want a 1:1 mapping of a library API to a Python API (there's SWIG for that), >> and you can't easily get more than a trivial mapping out of a script. But, >> yes, a one-shot generator for the necessary declarations would at least help >> in cases where the API to be wrapped is somewhat large. > > Hm, the main use that was proposed here for ctypes is to wrap existing > libraries (not to create nicer APIs, that can be done in pure Python > on top of this). The same applies to Cython, obviously. The main advantage of Cython over ctypes for this is that the Python-level wrapper code is also compiled into C, so whenever the need for a thicker wrapper arises in some part of the API, you don't loose any performance in intermediate layers. > In general, an existing library cannot be called > without access to its .h files -- there are probably struct and > constant definitions, platform-specific #ifdefs and #defines, and > other things in there that affect the linker-level calling conventions > for the functions in the library. (Just like Python's own .h files -- > e.g. the extensive renaming of the Unicode APIs depending on > narrow/wide build) How does Cython deal with these? In the CPython backend, the header files are normally #included by the generated C code, so they are used at C compilation time. Cython has its own view on the header files in separate declaration files (.pxd). Basically looks like this: # file "mymath.pxd" cdef extern from "aheader.h": double PI double E double abs(double x) These declaration files usually only contain the parts of a header file that are used in the user code, either manually copied over or extracted by scripts (that's what I was referring to in my reply to Terry). The complete 'real' content of the header file is then used by the C compiler at C compilation time. The user code employs a "cimport" statement to import the declarations at Cython compilation time, e.g. # file "mymodule.pyx" cimport mymath print mymath.PI + mymath.E would result in C code that #includes "aheader.h", adds the C constants "PI" and "E", converts the result to a Python float object and prints it out using the normal CPython machinery. This means that declarations can be reused across modules, just like with header files. In fact, Cython actually ships with a couple of common declaration files, e.g. for parts of libc, NumPy or CPython's C-API. I don't know that much about the IronPython backend, but from what I heard, it uses basically the same build time mechanisms and generates a thin C++ wrapper and a corresponding CLI part as glue layer. The ctypes backend for PyPy works different in that it generates a Python module from the .pxd files that contains the declarations as ctypes code. Then, the user code imports that normally at Python runtime. Obviously, this means that there are cases where the Cython-level declarations and thus the generated ctypes code will not match the ABI for a given target platform. So, in the worst case, there is a need to manually adapt the ctypes declarations in the Python module that was generated from the .pxd. Not worse than the current situation, though, and the rest of the Cython wrapper will compile into plain Python code that simply imports the declarations from the .pxd modules. But there's certainly room for improvements here. Stefan From p.f.moore at gmail.com Mon Aug 29 12:37:23 2011 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 29 Aug 2011 11:37:23 +0100 Subject: [Python-Dev] Ctypes and the stdlib In-Reply-To: References: <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On 29 August 2011 10:39, Stefan Behnel wrote: > In the CPython backend, the header files are normally #included by the > generated C code, so they are used at C compilation time. > > Cython has its own view on the header files in separate declaration files > (.pxd). Basically looks like this: > > ? ?# file "mymath.pxd" > ? ?cdef extern from "aheader.h": > ? ? ? ?double PI > ? ? ? ?double E > ? ? ? ?double abs(double x) > > These declaration files usually only contain the parts of a header file that > are used in the user code, either manually copied over or extracted by > scripts (that's what I was referring to in my reply to Terry). The complete > 'real' content of the header file is then used by the C compiler at C > compilation time. > > The user code employs a "cimport" statement to import the declarations at > Cython compilation time, e.g. > > ? ?# file "mymodule.pyx" > ? ?cimport mymath > ? ?print mymath.PI + mymath.E > > would result in C code that #includes "aheader.h", adds the C constants "PI" > and "E", converts the result to a Python float object and prints it out > using the normal CPython machinery. One thing that would make it easier for me to understand the role of Cython in this context would be to see a simple example of the type of "thin wrapper" we're talking about here. The above code is nearly this, but the pyx file executes "real code". For example, how do I simply expose pi and abs from math.h? Based on the above, I tried a pyx file containing just the code cdef extern from "math.h": double pi double abs(double x) but the resulting module exported no symbols. What am I doing wrong? Could you show a working example of writing such a wrapper? This is probably a bit off-topic, but it seems to me that whenever Cython comes up in these discussions, the implications of Cython-as-an-implementation-of-python obscure the idea of simply using Cython as a means of writing thin library wrappers. Just to clarify - the above code (if it works) seems to me like a nice simple means of writing wrappers. Something involving this in a pxd file, plus a pyx file with a whole load of dummy def abs(x): return cimported_module.abs(x) definitions, seems ok, but annoyingly clumsy. (Particularly for big APIs). I've kept python-dev in this response, on the assumption that others on the list might be glad of seeing a concrete example of using Cython to build wrapper code. But anything further should probably be taken off-list... Thanks, Paul. PS This would also probably be a useful addition to the Cython wiki and/or the manual. I searched both and found very little other than a page on wrapping C++ classes (which is not very helpful for simple C global functions and constants). From ezio.melotti at gmail.com Mon Aug 29 13:12:20 2011 From: ezio.melotti at gmail.com (Ezio Melotti) Date: Mon, 29 Aug 2011 14:12:20 +0300 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <4E582432.2080301@v.loewis.de> <20110827035602.557f772f@pitrou.net> Message-ID: On Sun, Aug 28, 2011 at 7:28 AM, Guido van Rossum wrote: > > Are you volunteering? (Even if you don't want to be the only > maintainer, it still sounds like you'd be a good co-maintainer of the > regex module.) > My name is listed in the experts index for 're' [0], and that should make me already "co-maintainer" for the module. > [...] > > > 4) add documentation for the module and the (public) functions in > > Doc/library (this should be done anyway). > > Does regex have a significany public C interface? (_sre.c doesn't.) > Does it have a Python-level interface beyond what re.py offers (apart > from the obvious new flags and new regex syntax/semantics)? > I don't think it does. Explaining the new syntax/semantics is useful for developers (e.g.what \p and \X are supposed to match), but also for users, so it's fine to have this documented in Doc/library/re.rst (and I don't think it's necessary to duplicate it in the README/PEP/Wiki). > > > This will ensure that the general quality of the code is good, and when > > someone actually has to work on the code, there's enough documentation to > > make it possible. > > That sounds like a good description of a process that could lead to > acceptance of regex as a re replacement. > > So if we want to get this done I think we need Matthew for 1) (unless someone else wants to do it and have him review the result). If making a diff with the current re is doable and makes sense, we can use the rietveld instance on the bug tracker to make the review for 2). The same could be done with a diff that replaces the whole module though. 3) will follow after 2), and 4) is not difficult and can be done when we actually replace re (it's probably enough to reorganize a bit and convert to rst the page on PyPI). Best Regards, Ezio Melotti [0]: http://docs.python.org/devguide/experts.html#stdlib -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon Aug 29 14:14:40 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 14:14:40 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20110829141440.2e2178c6@pitrou.net> On Mon, 29 Aug 2011 12:43:24 +0900 "Stephen J. Turnbull" wrote: > > Since when can s[0] represent a code point outside the BMP, for s a > Unicode string in a narrow build? > > Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about > what Python supports vs. the outside world. It's about what the str/ > unicode type is an array of. Why would that be? Antoine. From solipsis at pitrou.net Mon Aug 29 14:20:15 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 14:20:15 +0200 Subject: [Python-Dev] Software Transactional Memory for Python References: Message-ID: <20110829142015.5eb247dc@pitrou.net> On Sun, 28 Aug 2011 09:43:33 -0700 Guido van Rossum wrote: > > This sounds like a very interesting idea to pursue, even if it's late, > and even if it's experimental, and even if it's possible to cause > deadlocks (no news there). I propose that we offer a C API in Python > 3.3 as well as an extension module that offers the proposed decorator. > The C API could then be used to implement alternative APIs purely as > extension modules (e.g. would a deadlock-detecting API be possible?). We could offer the C API without shipping an extension module ourselves. I don't think we should provide (and maintain!) a Python API that helps users put themselves in all kind of nasty situations. There is enough misunderstanding around the GIL and multithreading already. Regards Antoine. From barry at python.org Mon Aug 29 14:30:29 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 08:30:29 -0400 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> Message-ID: <20110829083029.68faa57b@resist.wooz.org> On Aug 27, 2011, at 10:36 PM, Nadeem Vawda wrote: >I talked to Antoine about this on IRC; he didn't seem to think a PEP would be >necessary. But a summary of the discussion on the tracker issue might still >be a useful thing to have, given how long it's gotten. I agree with Antoine - no PEP should be necessary. A well reviewed and tested module should do it. -Barry From dave at dabeaz.com Mon Aug 29 14:41:23 2011 From: dave at dabeaz.com (David Beazley) Date: Mon, 29 Aug 2011 07:41:23 -0500 Subject: [Python-Dev] SWIG (was Re: Ctypes and the stdlib) In-Reply-To: References: Message-ID: On Mon, Aug 29, 2011 at 12:27 PM, Guido van Rossum wrote: > I wonder if for > this particular purpose SWIG isn't the better match. (If SWIG weren't > universally hated, even by its original author. :-) Hate is probably a strong word, but as the author of Swig, let me chime in here ;-). I think there are probably some lessons to be learned from Swig. As Nick noted, Swig is best suited when you have control over both sides (C/C++ and Python) of whatever code you're working with. In fact, the original motivation for Swig was to give application programmers (scientists in my case), a means for automatically generating the Python bindings to their code. However, there was one other important assumption--and that was the fact that all of your "real code" was going to be written in C/C++ and that the Python scripting interface was just an optional add-on (perhaps even just a throw-away thing). Keep in mind, Swig was first created in 1995 and at that time, the use of Python (or any similar language) was a pretty radical idea in the sciences. Moreover, there was a lot of legacy code that people just weren't going to abandon. Thus, I always viewed Swig as a kind of transitional vehicle for getting people to use Python who might otherwise not even consider it. Getting back to Nick's point though, to really use Swig effectively, it was always known that you might have to reorganize or refactor your C/C++ code to make it more Python friendly. However, due to the automatic wrapper generation, you didn't have to do it all at once. Basically your code could organically evolve and Swig would just keep up with whatever you were doing. In my projects, we'd usually just tuck Swig away in some Makefile somewhere and forget about it. One of the major complexities of Swig is the fact that it attempts to parse C/C++ header files. This very notion is actually a dangerous trap waiting for anyone who wants to wander into it. You might look at a header file and say, well how hard could it be to just grab a few definitions out of there? I'll just write a few regexs or come up with some simple hack for recognizing function definitions or something. Yes, you can do that, but you're immediately going to find that whatever approach you take starts to break down into horrible corner cases. Swig started out like this and quickly turned into a quagmire of esoteric bug reports. All sorts of problems with preprocessor macros, typedefs, missing headers, and other things. For awhile, I would get these bug reports that would go something like "I had this C++ class inside a namespace with an abstract method taking a typedef'd const reference to this smart pointer ..... and Swig broke." Hell, I can't even understand the bug report let alone know how to fix it. Almost all of these bugs were due to the fact that Swig started out as a hack and didn't really have any kind of solid conceptual foundation for how it should be put together. If you flash forward a bit, from about 2001-2004 there was a very serious push to fix these kinds of issues. Although it was not a complete rewrite of Swig, there were a huge number of changes to how it worked during this time. Swig grew a fully compatible C++ preprocessor that fully supported macros A complete C++ type system was implemented including support for namespaces, templates, and even such things as template partial specialization. Swig evolved into a multi-pass compiler that was doing all sorts of global analysis of the interface. Just to give you an idea, Swig would do things such as automatically detect/wrap C++ smart pointers. It could wrap overloaded C++ methods/function. Also, if you had a C++ class with virtual methods, it would only make one Python wrapper function and then reuse across all wrapped subclasses. Under the covers of all of this, the implementation basically evolved into a sophisticated macro preprocessor coupled with a pattern matching engine built on top of the C++ type system. For example, you could write patterns that matched specific C++ types (the much hated "typemap" feature) and you could write patterns that matched entire C++ declarations. This whole pattern matching approach had a huge power if you knew what you were doing. For example, I had a graduate student working on adding "contracts" to Swig--something that was being funded by a NSF grant. It was cool and mind boggling all at once. In hindsight however, I think the complexity of Swig has exceeded anyone's ability to fully understand it (including my own). For example, to even make sense of what's happening, you have to have a pretty solid grasp of the C/C++ type system (easier said than done). Couple that with all sorts of crazy pattern matching, low-level code fragments, and a ton of macro definitions, your head will literally explode if you try to figure out what's happening. So far as I know, recent versions of Swig have even combined all of this type-pattern matching with regular expressions. I can't even fathom it. Sadly, my involvement was Swig was an unfortunate casualty of my academic career biting the dust. By 2005, I was so burned out of working on it and so sick of what I was doing, I quite literally put all of my computer stuff aside to go play in a band for a few years. After a few years, I came back to programming (obviously), but not to keep working on the same stuff. In particularly, I will die quite happy if I never have to look at another line of C++ code ever again. No, I would much rather fling my toddlers, ride my bike, play piano, or do just about anything than ever do that again. Although I still subscribe the Swig mailing lists and watch what's happening, I'm not active with it at the moment. I've sometimes thought it might be interesting to create a Swig replacement purely in Python. When I work on the PLY project, this is often what I think about. In that project, I've actually built a number of the parsing tools that would be useful in creating such a thing. The only catch is that when I start thinking along these lines, I usually reach a point where I say "nah, I'll just write the whole application in Python." Anyways, this is probably way more than anyone wants to know about Swig. Getting back to the original topic of using it to make standard library modules, I just don't know. I think you probably could have some success with an automatic code generator of some kind. I'm just not sure it should take the Swig approach of parsing C++ headers. I think you could do better. Cheers, Dave P.S. By the way, if people want to know a lot more about Swig internals, they should check out the PyCon 2008 presentation I gave about it. http://www.dabeaz.com/SwigMaster/ From greg at krypto.org Mon Aug 29 14:51:42 2011 From: greg at krypto.org (Gregory P. Smith) Date: Mon, 29 Aug 2011 05:51:42 -0700 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: <20110829142015.5eb247dc@pitrou.net> References: <20110829142015.5eb247dc@pitrou.net> Message-ID: On Mon, Aug 29, 2011 at 5:20 AM, Antoine Pitrou wrote: > On Sun, 28 Aug 2011 09:43:33 -0700 > Guido van Rossum wrote: > > > > This sounds like a very interesting idea to pursue, even if it's late, > > and even if it's experimental, and even if it's possible to cause > > deadlocks (no news there). I propose that we offer a C API in Python > > 3.3 as well as an extension module that offers the proposed decorator. > > The C API could then be used to implement alternative APIs purely as > > extension modules (e.g. would a deadlock-detecting API be possible?). > > We could offer the C API without shipping an extension module ourselves. > I don't think we should provide (and maintain!) a Python API that helps > users put themselves in all kind of nasty situations. There is enough > misunderstanding around the GIL and multithreading already. > +1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Mon Aug 29 14:57:12 2011 From: arigo at tunes.org (Armin Rigo) Date: Mon, 29 Aug 2011 14:57:12 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: Hi Charles-Fran?ois, 2011/8/27 Charles-Fran?ois Natali : > The problem is that many locks are actually acquired implicitely. > For example, `print` to a buffered stream will acquire the fileobject's mutex. Indeed. After looking more at the kind of locks used throughout the stdlib, I notice that in many cases a lock is acquired by code in the following simple pattern: Py_BEGIN_ALLOW_THREADS PyThread_acquire_lock(self->lock, 1); Py_END_ALLOW_THREADS If one thread is waiting in the END_ALLOW_THREADS for another one to release the GIL, but the other one is in a "with atomic" block and tries to acquire the same lock, deadlock. But the issue can be resolved: the first thread in the above example needs to notice that the other thread is in a "with atomic" block, and "be nice" and release the lock again. Then it waits until the "with atomic" block finishes, and tries again from the start. We could do this by putting the above pattern it own function (which makes some sense anyway, because the pattern is repeated left and right, and is often complicated by an additional "if (!PyThread_acquire_lock(self->lock, 0))" before); and then allowing that function to be overridden by the external 'stm' module. I suspect that I need to do a more thorough review of the stdlib to make sure (at least more than now) that all potential deadlocking places can be avoided with a similar refactoring. All in all, it seems that the patch to CPython itself will need to be more than just the few lines in ceval.c --- but still very reasonable both in size and in content. A bient?t, Armin. From barry at python.org Mon Aug 29 15:00:56 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 09:00:56 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <4E59255E.6000905@v.loewis.de> References: <4E582432.2080301@v.loewis.de> <4E588877.3080204@v.loewis.de> <20110827121012.37b39947@pitrou.net> <4E59255E.6000905@v.loewis.de> Message-ID: <20110829090056.03f719ad@resist.wooz.org> On Aug 27, 2011, at 07:11 PM, Martin v. L?wis wrote: >A PEP should IMO only cover end-user aspects of the new re module. >Code organization is typically not in the PEP. To give a specific >example: you mentioned that there is (near) code duplication >MRAB's module. As a reviewer, I would discuss whether this can be >eliminated - but not in the PEP. +1 -Barry From benjamin at python.org Mon Aug 29 15:20:56 2011 From: benjamin at python.org (Benjamin Peterson) Date: Mon, 29 Aug 2011 09:20:56 -0400 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: <9FA8683B-FB0A-4F46-878F-11B36F92A342@twistedmatrix.com> References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <9FA8683B-FB0A-4F46-878F-11B36F92A342@twistedmatrix.com> Message-ID: 2011/8/29 Glyph Lefkowitz : > > On Aug 28, 2011, at 7:27 PM, Guido van Rossum wrote: > > In general, an existing library cannot be called > without access to its .h files -- there are probably struct and > constant definitions, platform-specific #ifdefs and #defines, and > other things in there that affect the linker-level calling conventions > for the functions in the library. > > Unfortunately I don't know a lot about this, but I keep hearing about > something called "rffi" that PyPy uses to call C from RPython: > . ?This has some > shortcomings currently, most notably the fact that it needs those .h files > (and therefore a C compiler) at runtime This is incorrect. rffi is actually quite like ctypes. The part you are referring to is probably rffi_platform [1], which invokes the compiler to determine constant values and struct offsets, or ctypes_configure, which does need runtime headers [2]. [1] https://bitbucket.org/pypy/pypy/src/92e36ab4eb5e/pypy/rpython/tool/rffi_platform.py [2] https://bitbucket.org/pypy/pypy/src/92e36ab4eb5e/ctypes_configure/ -- Regards, Benjamin From barry at python.org Mon Aug 29 15:33:05 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 09:33:05 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: References: <20110827020835.08a2a492@pitrou.net> Message-ID: <20110829093305.256a6e6b@resist.wooz.org> On Aug 26, 2011, at 05:25 PM, Dan Stromberg wrote: >from __future__ import is an established way of trying something for a while >to see if it's going to work. Actually, no. The documentation says: -----snip snip----- __future__ is a real module, and serves three purposes: * To avoid confusing existing tools that analyze import statements and expect to find the modules they?re importing. * To ensure that future statements run under releases prior to 2.1 at least yield runtime exceptions (the import of __future__ will fail, because there was no module of that name prior to 2.1). * To document when incompatible changes were introduced, and when they will be ? or were ? made mandatory. This is a form of executable documentation, and can be inspected programmatically via importing __future__ and examining its contents. -----snip snip----- So, really the __future__ module is a way to introduce accepted but incompatible changes in a controlled way, through successive releases. It's never been used to introduce experimental features that might be removed if they don't work out. Cheers, -Barry From barry at python.org Mon Aug 29 15:41:15 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 09:41:15 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <87fwknv92x.fsf@benfinney.id.au> References: <4E581986.3000709@egenix.com> <4E58258F.9050204@egenix.com> <87mxevvea5.fsf@benfinney.id.au> <4E584E52.1080606@pearwood.info> <87fwknv92x.fsf@benfinney.id.au> Message-ID: <20110829094115.4721b7f9@resist.wooz.org> On Aug 27, 2011, at 01:15 PM, Ben Finney wrote: >My question is directed more to M-A Lemburg's passage above, and its >implicit assumption that the user understand the changes between >?Unicode 2.0/3.0 semantics? and ?Unicode 6 semantics?, and how their own >needs relate to those semantics. More likely, it'll be a choice between wanting Unicode 6 semantics, and "don't care". So the PEP could include some clues as to why you'd care to use regex instead of re. -Barry From greg at krypto.org Mon Aug 29 15:44:27 2011 From: greg at krypto.org (Gregory P. Smith) Date: Mon, 29 Aug 2011 06:44:27 -0700 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 2:59 AM, Ask Solem wrote: > > On 26 Aug 2011, at 16:53, Antoine Pitrou wrote: > > > > > Hi, > > > >> I think that "deprecating" the use of threads w/ multiprocessing - or > >> at least crippling it is the wrong answer. Multiprocessing needs the > >> helper threads it uses internally to manage queues, etc. Removing that > >> ability would require a near-total rewrite, which is just a > >> non-starter. > > > > I agree that this wouldn't actually benefit anyone. > > Besides, I don't think it's even possible to avoid threads in > > multiprocessing, given the various constraints. We would have to force > > the user to run their main thread in an event loop, and that would be > > twisted (tm). > > > >> I would focus on the atfork() patch more directly, ignoring > >> multiprocessing in the discussion, and focusing on the merits of gps' > >> initial proposal and patch. > > > > I think this could also be combined with Charles-Fran?ois' patch. > > > > Regards > > > > Have to agree with Jesse and Antoine here. > > Celery (celeryproject.org) uses multiprocessing, is wildly used in > production, > and is regarded as stable software that have been known to run for months > at a time > only to be restarted for software upgrades. > > I have been investigating an issue for some time, that I'm pretty sure is > caused > by this. It occurs only rarely, so rarely I have not had any actual bug > reports > about it, it's just something I have experienced during extensive testing. > The tone of the discussion on the bug tracker makes me think that I have > been very lucky :-) > > Using the fork+exec approach seems like a much more realistic solution > than rewriting multiprocessing.Pool and Manager to not use threads. In fact > this is something I have been considering as a fix for the suspected > issue for for some time. > It does have implications that are annoying for sure, but we are already > used to this on the Windows platform (it could help portability even). > +3 (agreed to Jesse, Antoine and Ask here). The http://bugs.python.org/issue8713 described "non-fork" implementation that always uses subprocesses rather than plain forked processes is the right way forward for multiprocessing. -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Mon Aug 29 16:14:53 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 29 Aug 2011 16:14:53 +0200 Subject: [Python-Dev] Cython, ctypes and the stdlib In-Reply-To: References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: Hi, I agree that this is getting off-topic for this list. I'm answering here in a certain detail to lighten things up a bit regarding thin and thick wrappers, but please move further usage related questions to the cython-users mailing list. Paul Moore, 29.08.2011 12:37: > On 29 August 2011 10:39, Stefan Behnel wrote: >> In the CPython backend, the header files are normally #included by the >> generated C code, so they are used at C compilation time. >> >> Cython has its own view on the header files in separate declaration files >> (.pxd). Basically looks like this: >> >> # file "mymath.pxd" >> cdef extern from "aheader.h": >> double PI >> double E >> double abs(double x) >> >> These declaration files usually only contain the parts of a header file that >> are used in the user code, either manually copied over or extracted by >> scripts (that's what I was referring to in my reply to Terry). The complete >> 'real' content of the header file is then used by the C compiler at C >> compilation time. >> >> The user code employs a "cimport" statement to import the declarations at >> Cython compilation time, e.g. >> >> # file "mymodule.pyx" >> cimport mymath >> print mymath.PI + mymath.E >> >> would result in C code that #includes "aheader.h", adds the C constants "PI" >> and "E", converts the result to a Python float object and prints it out >> using the normal CPython machinery. > > One thing that would make it easier for me to understand the role of > Cython in this context would be to see a simple example of the type of > "thin wrapper" we're talking about here. The above code is nearly > this, but the pyx file executes "real code". Yes, that's the idea. If all you want is an exact, thin wrapper, you are better off with SWIG (well, assuming that performance is not important for you - Cython is a *lot* faster). But if you use it, or any other plain glue code generator, chances are that you will quickly learn that you do not actually want a thin wrapper. Instead, you want something that makes the external library easily and efficiently usable from Python code. Which means that the wrapper will be thin in some places and thick in others, sometimes very thick in selected places, and usually growing thicker over time. You can do this by using a glue code generator and writing the rest in a Python wrapper on top of the thin glue code. It's just that Cython makes such a wrapper much more efficient (for CPython), be it in terms of CPU performance (fast Python interaction, overhead-free C interaction, native C data type support, various Python code optimisations), or in terms of parallelisation support (explicit GIL-free threading and OpenMP), or just general programmer efficiency, e.g. regarding automatic data conversion or ease and safety of manual C memory management. > For example, how do I simply expose pi and abs from math.h? Based on > the above, I tried a pyx file containing just the code > > cdef extern from "math.h": > double pi > double abs(double x) > > but the resulting module exported no symbols. Recent Cython versions have support for directly exporting C values (e.g. enum values) at the Python module level. However, the normal way is to explicitly implement the module API as you guessed, i.e. cimport mydecls # assuming there is a mydecls.pxd PI = mydecls.PI def abs(x): return mydecls.abs(x) Looks simple, right? Nothing interesting here, until you start putting actual code into it, as in this (totally contrived and untested, but much more correct) example: from libc cimport math cdef extern from *: # these are defined by the always included Python.h: long LONG_MAX, LONG_MIN def abs(x): if isinstance(x, float): # -> C double return math.fabs(x) elif isinstance(x, int): # -> may or may not be a C integer if LONG_MIN <= x <= LONG_MAX: return math.labs(x) else: # either within "long long" or raise OverflowError return math.llabs(x) else: # assume it can at least coerce to a C long, # or raise ValueError or OverflowError or whatever return math.labs(x) BTW, there is some simple templating/generics-like type merging support upcoming in a GSoC to simplify this kind of type specific code. > This is probably a bit off-topic, but it seems to me that whenever > Cython comes up in these discussions, the implications of > Cython-as-an-implementation-of-python obscure the idea of simply using > Cython as a means of writing thin library wrappers. Cython is not a glue code generator, it's a full-fledged programming language. It's Python, with additional support for C data types. That makes it great for writing non-trivial wrappers between Python and C. It's not so great for the trivial cases, but luckily, those are rare. ;) > I've kept python-dev in this response, on the assumption that others > on the list might be glad of seeing a concrete example of using Cython > to build wrapper code. But anything further should probably be taken > off-list... Agreed. The best place for asking about Cython usage is the cython-users mailing list. > PS This would also probably be a useful addition to the Cython wiki > and/or the manual. I searched both and found very little other than a > page on wrapping C++ classes (which is not very helpful for simple C > global functions and constants). Hmm, ok, I guess that's because it's too simple (you actually guessed how it works) and a somewhat rare use case. In most cases, wrappers tend to use extension types, as presented here: http://docs.cython.org/src/tutorial/clibraries.html Stefan From barry at python.org Mon Aug 29 18:24:20 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 12:24:20 -0400 Subject: [Python-Dev] PEP categories (was Re: PEP 393 review) In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> Message-ID: <20110829122420.2d342f9c@resist.wooz.org> On Aug 29, 2011, at 11:03 AM, Dirkjan Ochtman wrote: >Also, this PEP makes me wonder if there should be a way to distinguish >between language PEPs and (CPython) implementation PEPs, by adding a >tag or using the PEP number ranges somehow. I've thought about this, and about a similar split between language changes and stdlib changes (i.e. new modules such as regex). Probably the best thing to do would be to allocate some 1000's to the different categories, like we did for the 3xxx Python 3k PEPS (now largely moot though). -Barry From neologix at free.fr Mon Aug 29 18:24:29 2011 From: neologix at free.fr (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Mon, 29 Aug 2011 18:24:29 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> Message-ID: > +3 (agreed to Jesse, Antoine and Ask here). > ?The?http://bugs.python.org/issue8713?described "non-fork" implementation > that always uses subprocesses rather than plain forked processes is the > right way forward for multiprocessing. I see two drawbacks: - it will be slower, since the interpreter startup time is non-negligible (well, normally you shouldn't spawn a new process for every item, but it should be noted) - it'll consume more memory, since we lose the COW advantage (even though it's already limited by the fact that even treating a variable read-only can trigger an incref, as was noted in a previous thread) cf From dirkjan at ochtman.nl Mon Aug 29 18:38:23 2011 From: dirkjan at ochtman.nl (Dirkjan Ochtman) Date: Mon, 29 Aug 2011 18:38:23 +0200 Subject: [Python-Dev] PEP categories (was Re: PEP 393 review) In-Reply-To: <20110829122420.2d342f9c@resist.wooz.org> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <20110829122420.2d342f9c@resist.wooz.org> Message-ID: On Mon, Aug 29, 2011 at 18:24, Barry Warsaw wrote: >>Also, this PEP makes me wonder if there should be a way to distinguish >>between language PEPs and (CPython) implementation PEPs, by adding a >>tag or using the PEP number ranges somehow. > > I've thought about this, and about a similar split between language changes > and stdlib changes (i.e. new modules such as regex). ?Probably the best thing > to do would be to allocate some 1000's to the different categories, like we > did for the 3xxx Python 3k PEPS (now largely moot though). Allocating 1000's seems sensible enough to me. And yes, the division between recents 3x and non-3x PEPs seems quite arbitrary. Cheers, Dirkjan P.S. Perhaps the index could list accepted and open PEPs before meta and informational? And maybe reverse the order under some headings, for example in the finished category... From solipsis at pitrou.net Mon Aug 29 18:40:37 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 18:40:37 +0200 Subject: [Python-Dev] PEP categories (was Re: PEP 393 review) References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <20110829122420.2d342f9c@resist.wooz.org> Message-ID: <20110829184037.594359f0@pitrou.net> On Mon, 29 Aug 2011 18:38:23 +0200 Dirkjan Ochtman wrote: > On Mon, Aug 29, 2011 at 18:24, Barry Warsaw wrote: > >>Also, this PEP makes me wonder if there should be a way to distinguish > >>between language PEPs and (CPython) implementation PEPs, by adding a > >>tag or using the PEP number ranges somehow. > > > > I've thought about this, and about a similar split between language changes > > and stdlib changes (i.e. new modules such as regex). ?Probably the best thing > > to do would be to allocate some 1000's to the different categories, like we > > did for the 3xxx Python 3k PEPS (now largely moot though). > > Allocating 1000's seems sensible enough to me. > > And yes, the division between recents 3x and non-3x PEPs seems quite arbitrary. I like the 3k numbers myself :)) From stefan_ml at behnel.de Mon Aug 29 18:55:00 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 29 Aug 2011 18:55:00 +0200 Subject: [Python-Dev] PEP categories (was Re: PEP 393 review) In-Reply-To: <20110829122420.2d342f9c@resist.wooz.org> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <20110829122420.2d342f9c@resist.wooz.org> Message-ID: Barry Warsaw, 29.08.2011 18:24: > On Aug 29, 2011, at 11:03 AM, Dirkjan Ochtman wrote: > >> Also, this PEP makes me wonder if there should be a way to distinguish >> between language PEPs and (CPython) implementation PEPs, by adding a >> tag or using the PEP number ranges somehow. > > I've thought about this, and about a similar split between language changes > and stdlib changes (i.e. new modules such as regex). Probably the best thing > to do would be to allocate some 1000's to the different categories, like we > did for the 3xxx Python 3k PEPS (now largely moot though). These things tend to get somewhat clumsy over time, though. What about a stdlib change that only applies to CPython for some reason, e.g. because no other implementation currently has that module? I think it's ok to make a coarse-grained distinction by numbers, but there should also be a way to tag PEPs textually. Stefan From jnoller at gmail.com Mon Aug 29 19:03:53 2011 From: jnoller at gmail.com (Jesse Noller) Date: Mon, 29 Aug 2011 13:03:53 -0400 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> Message-ID: 2011/8/29 Charles-Fran?ois Natali : >> +3 (agreed to Jesse, Antoine and Ask here). >> ?The?http://bugs.python.org/issue8713?described "non-fork" implementation >> that always uses subprocesses rather than plain forked processes is the >> right way forward for multiprocessing. > > I see two drawbacks: > - it will be slower, since the interpreter startup time is > non-negligible (well, normally you shouldn't spawn a new process for > every item, but it should be noted) Yes; but spawning and forking are both slow to begin with - it's documented (I hope heavily enough) that you should spawn multiprocessing children early, and keep them around instead of constantly creating/destroying them. > - it'll consume more memory, since we lose the COW advantage (even > though it's already limited by the fact that even treating a variable > read-only can trigger an incref, as was noted in a previous thread) > > cf Yes, it would consume slightly more memory; but the benefits - making it consistent across *all* platforms with the *same* restrictions gets us closer to the principle of least surprise. From eliben at gmail.com Mon Aug 29 19:14:13 2011 From: eliben at gmail.com (Eli Bendersky) Date: Mon, 29 Aug 2011 20:14:13 +0300 Subject: [Python-Dev] SWIG (was Re: Ctypes and the stdlib) In-Reply-To: References: Message-ID: > I've sometimes thought it might be interesting to create a Swig replacement > purely in Python. When I work on the PLY project, this is often what I > think about. In that project, I've actually built a number of the parsing > tools that would be useful in creating such a thing. The only catch is > that when I start thinking along these lines, I usually reach a point where > I say "nah, I'll just write the whole application in Python." > > Anyways, this is probably way more than anyone wants to know about Swig. > Getting back to the original topic of using it to make standard library > modules, I just don't know. I think you probably could have some success > with an automatic code generator of some kind. I'm just not sure it should > take the Swig approach of parsing C++ headers. I think you could do better. > > Dave, Having written a full C99 parser (http://code.google.com/p/pycparser/) based on your (excellent) PLY library, my impression is that the problem is with the problem, not with the solution. Strange sentence, I know :-) What I mean is that parsing C++ (even its headers) is inherently hard, which is why the solutions tend to grow so complex. Even with the modest C99, clean and simple solutions based on theoretical approaches (like PLY with its generated LALR parsers) tend to run into walls [*]. C++ is an order of magnitude harder. If I went to implement something like SWIG today, I would almost surely base my implementation on Clang (http://clang.llvm.org/). They have a full C++ parser (carefully hand-crafted, quite admirably keeping a relatively comprehensible code-base for such a task) used in a real compiler front-end, and a flexible library structure aimed at creating tools. There are also Python bindings that would allow to do most of the interesting Python-interface-specific work in Python - parse the C++ headers using Clang's existing parser into ASTs - then generate ctypes / extensions from that, *in Python*. The community is also gladly accepting contributions. I've had some fixes committed for the Python bindings and the C interfaces that tie them to Clang, and got the impression from Clang's core devs that further contributions will be most welcome. So whatever is missing from the Python bindings can be easily added. Eli [*] http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon Aug 29 19:16:08 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 19:16:08 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> Message-ID: <20110829191608.7916da73@pitrou.net> On Mon, 29 Aug 2011 13:03:53 -0400 Jesse Noller wrote: > 2011/8/29 Charles-Fran?ois Natali : > >> +3 (agreed to Jesse, Antoine and Ask here). > >> ?The?http://bugs.python.org/issue8713?described "non-fork" implementation > >> that always uses subprocesses rather than plain forked processes is the > >> right way forward for multiprocessing. > > > > I see two drawbacks: > > - it will be slower, since the interpreter startup time is > > non-negligible (well, normally you shouldn't spawn a new process for > > every item, but it should be noted) > > Yes; but spawning and forking are both slow to begin with - it's > documented (I hope heavily enough) that you should spawn > multiprocessing children early, and keep them around instead of > constantly creating/destroying them. I think fork() is quite fast on modern systems (e.g. Linux). exec() is certainly slow, though. The third drawback is that you are limited to picklable objects when specifying the arguments for your child process. This can be annoying if, for example, you wanted to pass an OS resource. Regards Antoine. From jnoller at gmail.com Mon Aug 29 19:23:20 2011 From: jnoller at gmail.com (Jesse Noller) Date: Mon, 29 Aug 2011 13:23:20 -0400 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: <20110829191608.7916da73@pitrou.net> References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> <20110829191608.7916da73@pitrou.net> Message-ID: On Mon, Aug 29, 2011 at 1:16 PM, Antoine Pitrou wrote: > On Mon, 29 Aug 2011 13:03:53 -0400 > Jesse Noller wrote: >> 2011/8/29 Charles-Fran?ois Natali : >> >> +3 (agreed to Jesse, Antoine and Ask here). >> >> ?The?http://bugs.python.org/issue8713?described "non-fork" implementation >> >> that always uses subprocesses rather than plain forked processes is the >> >> right way forward for multiprocessing. >> > >> > I see two drawbacks: >> > - it will be slower, since the interpreter startup time is >> > non-negligible (well, normally you shouldn't spawn a new process for >> > every item, but it should be noted) >> >> Yes; but spawning and forking are both slow to begin with - it's >> documented (I hope heavily enough) that you should spawn >> multiprocessing children early, and keep them around instead of >> constantly creating/destroying them. > > I think fork() is quite fast on modern systems (e.g. Linux). exec() is > certainly slow, though. > > The third drawback is that you are limited to picklable objects when > specifying the arguments for your child process. This can be annoying > if, for example, you wanted to pass an OS resource. > > Regards > > Antoine. Yes, it is annoying; but again - this makes it more consistent with the windows implementation. I'd rather that restriction than the "sanitization" of the ability to use threading and multiprocessing alongside one another. From solipsis at pitrou.net Mon Aug 29 19:22:53 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 19:22:53 +0200 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> <20110829191608.7916da73@pitrou.net> Message-ID: <1314638573.3551.14.camel@localhost.localdomain> Le lundi 29 ao?t 2011 ? 13:23 -0400, Jesse Noller a ?crit : > > Yes, it is annoying; but again - this makes it more consistent with > the windows implementation. I'd rather that restriction than the > "sanitization" of the ability to use threading and multiprocessing > alongside one another. That sanitization is generally useful, though. For example if you want to use any I/O after a fork(). Regards Antoine. From s.brunthaler at uci.edu Mon Aug 29 19:35:14 2011 From: s.brunthaler at uci.edu (stefan brunthaler) Date: Mon, 29 Aug 2011 10:35:14 -0700 Subject: [Python-Dev] Python 3 optimizations continued... Message-ID: Hi, pretty much a year ago I wrote about the optimizations I did for my PhD thesis that target the Python 3 series interpreters. While I got some replies, the discussion never really picked up and no final explicit conclusion was reached. AFAICT, because of the following two factors, my optimizations were not that interesting for inclusion with the distribution at that time: a) Unladden Swallow was targeting Python 3, too. b) My prototype did not pass the regression tests. As of November 2010 (IIRC), Google is not supporting work on US anymore, and the project is stalled. (If I am wrong and there is still activity and any plans with the corresponding PEP, please let me know.) Which is why I recently spent some time fixing issues so that I can run the regression tests. There is still some work to be done, but by and large it should be possible to complete all regression tests in reasonable time (with the actual infrastructure in place, enabling optimizations later on is not a problem at all, too.) So, the two big issues aside, is there any interest in incorporating these optimizations in Python 3? Have a nice day, --stefan PS: It probably is unusual, but in a part of my home page I have created a link to indicate interest (makes both counting and voting easier, http://www.ics.uci.edu/~sbruntha/) There were also links indicating interest in funding the work; I have disabled these, so as not to upset anybody or make the impression of begging for money... From jnoller at gmail.com Mon Aug 29 19:42:02 2011 From: jnoller at gmail.com (Jesse Noller) Date: Mon, 29 Aug 2011 13:42:02 -0400 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: <1314638573.3551.14.camel@localhost.localdomain> References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> <20110829191608.7916da73@pitrou.net> <1314638573.3551.14.camel@localhost.localdomain> Message-ID: On Mon, Aug 29, 2011 at 1:22 PM, Antoine Pitrou wrote: > Le lundi 29 ao?t 2011 ? 13:23 -0400, Jesse Noller a ?crit : >> >> Yes, it is annoying; but again - this makes it more consistent with >> the windows implementation. I'd rather that restriction than the >> "sanitization" of the ability to use threading and multiprocessing >> alongside one another. > > That sanitization is generally useful, though. For example if you want > to use any I/O after a fork(). Oh! I don't disagree; I'm just against the removal of the ability to mix multiprocessing and threads; which it does internally and others do in every day code. The "proposed" removal of that functionality - using the two together - would leave users in the dust, and not needed if we patch http://bugs.python.org/issue8713 - which at it's core is just an addition flag. We could document the risk(s) of using the fork() mechanism which has to remain the default for some time. The point is, is that the solution to http://bugs.python.org/issue6721 should not be intertwined or cause a severe change in the multiprocessing module (e.g. "rewriting from scratch"), etc. I'm not arguing that both bugs should not be fixed. jesse From benjamin at python.org Mon Aug 29 20:10:12 2011 From: benjamin at python.org (Benjamin Peterson) Date: Mon, 29 Aug 2011 14:10:12 -0400 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: Message-ID: 2011/8/29 stefan brunthaler : > So, the two big issues aside, is there any interest in incorporating > these optimizations in Python 3? Perhaps there would be something to say given patches/overviews/specifics. -- Regards, Benjamin From nir at winpdb.org Mon Aug 29 20:29:11 2011 From: nir at winpdb.org (Nir Aides) Date: Mon, 29 Aug 2011 21:29:11 +0300 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: <20110829191608.7916da73@pitrou.net> References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> <20110829191608.7916da73@pitrou.net> Message-ID: On Mon, Aug 29, 2011 at 8:16 PM, Antoine Pitrou wrote: > > On Mon, 29 Aug 2011 13:03:53 -0400 Jesse Noller wrote: > > > > Yes; but spawning and forking are both slow to begin with - it's > > documented (I hope heavily enough) that you should spawn > > multiprocessing children early, and keep them around instead of > > constantly creating/destroying them. > > I think fork() is quite fast on modern systems (e.g. Linux). exec() is > certainly slow, though. On my system, the time it takes worker code to start is: 40 usec with thread.start_new_thread 240 usec with threading.Thread().start 450 usec with os.fork 1 ms with multiprocessing.Process.start 25 ms with subprocess.Popen to start a trivial script. so os.fork has similar latency to threading.Thread().start, while spawning is 100 times slower. From s.brunthaler at uci.edu Mon Aug 29 20:33:14 2011 From: s.brunthaler at uci.edu (stefan brunthaler) Date: Mon, 29 Aug 2011 11:33:14 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: Message-ID: > Perhaps there would be something to say given patches/overviews/specifics. > Currently I don't have patches, but for an overview and specifics, I can provide the following: * My optimizations basically rely on quickening to incorporate run-time information. * I use two separate instruction dispatch routines, and use profiling to switch from the regular Python 3 dispatch routine to an optimized one (the implementation is actually vice versa, but that is not important now) * The optimized dispatch routine has a changed instruction format (word-sized instead of bytecodes) that allows for regular instruction decoding (without the HAS_ARG-check) and inlinind of some objects in the instruction format on 64bit architectures. * I use inline-caching based on quickening (passes almost all regression tests [302 out of 307]), eliminate reference count operations using quickening (passes but has a memory leak), promote frequently accessed local variables to their dedicated instructions (passes), and cache LOAD_GLOBAL/LOAD_NAME objects in the instruction encoding when possible (I am working on this right now.) The changes I made can be summarized as: * I changed some header files to accommodate additional information (Python.h, ceval.h, code.h, frameobject.h, opcode.h, tupleobject.h) * I changed mostly abstract.c to incorporate runtime-type feedback. * All other changes target mostly ceval.c and all supplementary code is in a sub-directory named "opt" and all generated files in a sub-directory within that ("opt/gen"). * I have a code generator in place that takes care of generating all the functions; it uses the Mako template system for creating C code and does not necessarily need to be shipped with the interpreter (though one can play around and experiment with it.) So, all in all, the changes are not that big to the actual implementation, and most of the code is generated (using sloccount, opt has 1990 lines of C, and opt/gen has 8649 lines of C). That's a quick summary, if there are any further or more in-depth questions, let me know. best, --stefan From cournape at gmail.com Mon Aug 29 20:37:31 2011 From: cournape at gmail.com (David Cournapeau) Date: Mon, 29 Aug 2011 20:37:31 +0200 Subject: [Python-Dev] SWIG (was Re: Ctypes and the stdlib) In-Reply-To: References: Message-ID: On Mon, Aug 29, 2011 at 7:14 PM, Eli Bendersky wrote: > >> >> I've sometimes thought it might be interesting to create a Swig >> replacement purely in Python. ?When I work on the PLY project, this is often >> what I think about. ? In that project, I've actually built a number of the >> parsing tools that would be useful in creating such a thing. ? The only >> catch is that when I start thinking along these lines, I usually reach a >> point where I say "nah, I'll just write the whole application in Python." >> >> Anyways, this is probably way more than anyone wants to know about Swig. >> Getting back to the original topic of using it to make standard library >> modules, I just don't know. ? I think you probably could have some success >> with an automatic code generator of some kind. ?I'm just not sure it should >> take the Swig approach of parsing C++ headers. ?I think you could do better. >> > > Dave, > > Having written a full C99 parser (http://code.google.com/p/pycparser/) based > on your (excellent) PLY library, my impression is that the problem is with > the problem, not with the solution. Strange sentence, I know :-) What I mean > is that parsing C++ (even its headers) is inherently hard, which is why the > solutions tend to grow so complex. Even with the modest C99, clean and > simple solutions based on theoretical approaches (like PLY with its > generated LALR parsers) tend to run into walls [*]. C++ is an order of > magnitude harder. > > If I went to implement something like SWIG today, I would almost surely base > my implementation on Clang (http://clang.llvm.org/). They have a full C++ > parser (carefully hand-crafted, quite admirably keeping a relatively > comprehensible code-base for such a task) used in a real compiler front-end, > and a flexible library structure aimed at creating tools. There are also > Python bindings that would allow to do most of the interesting > Python-interface-specific work in Python - parse the C++ headers using > Clang's existing parser into ASTs - then generate ctypes / extensions from > that, *in Python*. > > The community is also gladly accepting contributions. I've had some fixes > committed for the Python bindings and the C interfaces that tie them to > Clang, and got the impression from Clang's core devs that further > contributions will be most welcome. So whatever is missing from the Python > bindings can be easily added. Agreed, I know some people have looked into that direction in the scientific python community (to generate .pxd for cython). I wrote one of the hack Stefan refered to (based on ctypeslib using gccxml), and using clang makes so much more sense. To go back to the initial issue, using cython to wrap C code makes a lot of sense. In the scipy community, I believe there is a broad agreement that most of code which would requires C/C++ should be done in cython instead (numpy and scipy already do so a bit). I personally cannot see man situations where writing wrappers in C by hand works better than cython (especially since cython handles python2/3 automatically for you). cheers, David From barry at python.org Mon Aug 29 20:59:02 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 14:59:02 -0400 Subject: [Python-Dev] PEP categories (was Re: PEP 393 review) In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <20110829122420.2d342f9c@resist.wooz.org> Message-ID: <20110829145902.4774d0fc@resist.wooz.org> On Aug 29, 2011, at 06:55 PM, Stefan Behnel wrote: >These things tend to get somewhat clumsy over time, though. What about a >stdlib change that only applies to CPython for some reason, e.g. because no >other implementation currently has that module? I think it's ok to make a >coarse-grained distinction by numbers, but there should also be a way to tag >PEPs textually. Yeah, the categories would be pretty coarse grained, and their orthogonality would cause classification problems. I suppose we could use some kind of hashtag approach. OTOH, I'm not entirely sure it's worth it either. ;) I think we'd need a concrete proposal and someone willing to hack the PEP0 autogen tools. -Barry From barry at python.org Mon Aug 29 21:00:07 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 15:00:07 -0400 Subject: [Python-Dev] PEP categories (was Re: PEP 393 review) In-Reply-To: <20110829184037.594359f0@pitrou.net> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <20110829122420.2d342f9c@resist.wooz.org> <20110829184037.594359f0@pitrou.net> Message-ID: <20110829150007.72460089@resist.wooz.org> On Aug 29, 2011, at 06:40 PM, Antoine Pitrou wrote: >I like the 3k numbers myself :)) Me too. :) But I think we've pretty much abandoned that convention for any new PEPs. Well, until Guido announces Python 4k. :) -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From nadeem.vawda at gmail.com Mon Aug 29 21:04:38 2011 From: nadeem.vawda at gmail.com (Nadeem Vawda) Date: Mon, 29 Aug 2011 21:04:38 +0200 Subject: [Python-Dev] LZMA compression support in 3.3 In-Reply-To: <20110829083029.68faa57b@resist.wooz.org> References: <4E59041A.7040100@v.loewis.de> <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <20110829083029.68faa57b@resist.wooz.org> Message-ID: I've updated the issue with a patch containing my work so far - the LZMACompressor and LZMADecompressor classes, along with some tests. These two classes should provide a fairly complete interface to liblzma; it will be possible to implement LZMAFile on top of them, entirely in Python. Note that the C code does no I/O; this will be handled by LZMAFile. Please take a look, and let me know what you think. Cheers, Nadeem From martin at v.loewis.de Mon Aug 29 21:20:53 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Mon, 29 Aug 2011 21:20:53 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> Message-ID: <4E5BE695.2070203@v.loewis.de> Am 29.08.2011 11:03, schrieb Dirkjan Ochtman: > On Sun, Aug 28, 2011 at 21:47, "Martin v. L?wis" wrote: >> result strings. In PEP 393, a buffer must be scanned for the >> highest code point, which means that each byte must be inspected >> twice (a second time when the copying occurs). > > This may be a silly question: are there things in place to optimize > this for the case where two strings are combined? E.g. highest > character in combined string is max(highest character in either of the > strings). Unicode_Concat goes like this maxchar = PyUnicode_MAX_CHAR_VALUE(u); if (PyUnicode_MAX_CHAR_VALUE(v) > maxchar) maxchar = PyUnicode_MAX_CHAR_VALUE(v); /* Concat the two Unicode strings */ w = (PyUnicodeObject *) PyUnicode_New( PyUnicode_GET_LENGTH(u) + PyUnicode_GET_LENGTH(v), maxchar); if (w == NULL) goto onError; PyUnicode_CopyCharacters(w, 0, u, 0, PyUnicode_GET_LENGTH(u)); PyUnicode_CopyCharacters(w, PyUnicode_GET_LENGTH(u), v, 0, PyUnicode_GET_LENGTH(v)); > Also, this PEP makes me wonder if there should be a way to distinguish > between language PEPs and (CPython) implementation PEPs, by adding a > tag or using the PEP number ranges somehow. Well, no. This would equally apply to every single patch, and is just not feasible. Instead, alternative implementations typically target a CPython version, and then find out what features they need to implement to claim conformance. Regards, Martin From tjreedy at udel.edu Mon Aug 29 21:24:12 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 29 Aug 2011 15:24:12 -0400 Subject: [Python-Dev] Should we move to replace re with regex? In-Reply-To: <20110829090056.03f719ad@resist.wooz.org> References: <4E582432.2080301@v.loewis.de> <4E588877.3080204@v.loewis.de> <20110827121012.37b39947@pitrou.net> <4E59255E.6000905@v.loewis.de> <20110829090056.03f719ad@resist.wooz.org> Message-ID: On 8/29/2011 9:00 AM, Barry Warsaw wrote: > On Aug 27, 2011, at 07:11 PM, Martin v. L?wis wrote: > >> A PEP should IMO only cover end-user aspects of the new re module. >> Code organization is typically not in the PEP. To give a specific >> example: you mentioned that there is (near) code duplication >> MRAB's module. As a reviewer, I would discuss whether this can be >> eliminated - but not in the PEP. > > +1 I think at this point we need a tracker issue to which can be attached such reviews, for safe-keeping, even if most discussion continues here. -- Terry Jan Reedy From ndbecker2 at gmail.com Mon Aug 29 21:28:23 2011 From: ndbecker2 at gmail.com (Neal Becker) Date: Mon, 29 Aug 2011 15:28:23 -0400 Subject: [Python-Dev] SWIG (was Re: Ctypes and the stdlib) References: Message-ID: Then there is gccxml, although I'm not sure how active it is now. From martin at v.loewis.de Mon Aug 29 21:34:48 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 29 Aug 2011 21:34:48 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5B5364.9040100@haypocalc.com> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <20110825132734.1c236d17@pitrou.net> <4E5A9B39.8090009@v.loewis.de> <1314561666.3656.3.camel@localhost.localdomain> <4E5AADDA.5090206@v.loewis.de> <4E5B5364.9040100@haypocalc.com> Message-ID: <4E5BE9D8.5050309@v.loewis.de> >> Those haven't been ported to the new API, yet. Consider, for example, >> d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; >> with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this >> is a 25% speedup for PEP 393. > > If I understand correctly, the performance now highly depend on the used > characters? A pure ASCII string is faster than a string with characters > in the ISO-8859-1 charset? How did you infer that from above paragraph??? ASCII and Latin-1 are mostly identical in terms of performance - the ASCII decoder should be slightly slower than the Latin-1 decoder, since the ASCII decoder needs to check for errors, whereas the Latin-1 decoder will never be confronted with errors. What matters is a) is the codec already rewritten to use the new representation, or must it go through Py_UNICODE[] first, requiring then a second copy to the canonical form? b) what is the cost of finding out the highest character? - regardless of what the highest character turns out to be > Is it also true for BMP characters vs non-BMP > characters? Well... If you are talking about the ASCII and Latin-1 codecs - neither of these support most BMP characters, let alone non-BMP characters. In general, non-BMP characters are more expensive to process since they take more space. > Do these benchmark tools use only ASCII characters, or also some > ISO-8859-1 characters? See for yourself. iobench uses Latin-1, including non-ASCII, but not non-Latin-1. > Or, better, different Unicode ranges in different tests? That's why I asked for a list of benchmarks to perform. I cannot run an infinite number of benchmarks prior to adoption of the PEP. Regards, Martin From nir at winpdb.org Mon Aug 29 21:41:27 2011 From: nir at winpdb.org (Nir Aides) Date: Mon, 29 Aug 2011 22:41:27 +0300 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> <20110829191608.7916da73@pitrou.net> <1314638573.3551.14.camel@localhost.localdomain> Message-ID: On Mon, Aug 29, 2011 at 8:42 PM, Jesse Noller wrote: > On Mon, Aug 29, 2011 at 1:22 PM, Antoine Pitrou wrote: >> >> That sanitization is generally useful, though. For example if you want >> to use any I/O after a fork(). > > Oh! I don't disagree; I'm just against the removal of the ability to > mix multiprocessing and threads; which it does internally and others > do in every day code. I am not familiar with the python-dev definition for deprecation, but when I used the word in the bug discussion I meant to advertize to users that they should not mix threading and forking since that mix is and will remain broken by design; I did not mean removal or crippling of functionality. ?When I use a word,? Humpty Dumpty said, in rather a scornful tone, ?it means just what I choose it to mean?neither more nor less.? - Through the Looking-Glass (btw, my tone is not scornful) And there is no way around it - the mix in general is broken, with an atfork mechanism or without it. People can choose to keep doing it in their every day code at their own risk, be it significantly high or insignificantly low. But the documentation should explain the problem clearly. As for the internal use of threads in the multiprocessing module I proposed a potential way to "sanitize" those particular worker threads: http://bugs.python.org/issue6721#msg140402 If it makes sense and entails changes to internal multiprocessing worker threads, those changes could be applied as bug fixes to Python 2.x and previous Python 3.x releases. This does not contradict adding now the feature to spawn, and to make it the only possibility in the future. I agree that this is the "saner" approach but it is a new feature not a bug fix. Nir From martin at v.loewis.de Mon Aug 29 22:32:01 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 29 Aug 2011 22:32:01 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> Message-ID: <4E5BF741.50209@v.loewis.de> tl;dr: PEP-393 reduces the memory usage for strings of a very small Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB. Am 26.08.2011 16:55, schrieb Guido van Rossum: > It would be nice if someone wrote a test to roughly verify these > numbers, e.v. by allocating lots of strings of a certain size and > measuring the process size before and after (being careful to adjust > for the list or other data structure required to keep those objects > alive). I have now written a Django application to measure the effect of PEP 393, using the debug mode (to find all strings), and sys.getsizeof: https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof/count/views.py The results for 3.3 and pep-393 are attached. The Django app is small in every respect: trivial ORM, very few objects (just for the sake of exercising the ORM at all), no templating, short strings. The memory snapshot is taken in the middle of a request. The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE. The tally of strings by length confirms that both tests have indeed comparable sets of objects (not surprising since it is identical Django source code and the identical application). Most strings in this benchmark are shorter than 16 characters, and a few have several thousand characters. The tally of byte lengths shows that it's the really long memory blocks that are gone with the PEP. Digging into the internal representation, it's possibly to estimate "unaccounted" bytes. For PEP 393: bytes - 80*strings - (chars+strings) = 190053 This is the total of the wchar_t and UTF-8 representations for objects that have them, plus any 2-byte and four-byte strings accounted incorrectly in above formula. Unfortunately, for "default" bytes + 56*strings - 4*(chars+strings) = 0 as unicode__sizeof__ doesn't account for the (separate) PyBytes object that may carry the default encoding. So in practice, the 3.3 number should be somewhat larger. In both cases, the app didn't cope for internal fragmentation; this would be possible by rounding up each string size to the next multiple of 8 (given that it's all allocated through the object allocator). It should be possible to squeeze a little bit out of the 190kB, by finding objects for which the wchar_t or UTF-8 representations are created unnecessarily. Regards, Martin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 3k.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 393.txt URL: From martin at v.loewis.de Mon Aug 29 22:43:35 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 29 Aug 2011 22:43:35 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: Message-ID: <4E5BF9F7.9020608@v.loewis.de> > So, the two big issues aside, is there any interest in incorporating > these optimizations in Python 3? The question really is whether this is an all-or-nothing deal. If you could identify smaller parts that can be applied independently, interest would be higher. Also, I'd be curious whether your techniques help or hinder a potential integration of a JIT generator. Regards, Martin From mal at egenix.com Mon Aug 29 22:54:27 2011 From: mal at egenix.com (M.-A. Lemburg) Date: Mon, 29 Aug 2011 22:54:27 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5BF741.50209@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> <4E5BF741.50209@v.loewis.de> Message-ID: <4E5BFC83.2020304@egenix.com> "Martin v. L?wis" wrote: > tl;dr: PEP-393 reduces the memory usage for strings of a very small > Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB. > > Am 26.08.2011 16:55, schrieb Guido van Rossum: >> It would be nice if someone wrote a test to roughly verify these >> numbers, e.v. by allocating lots of strings of a certain size and >> measuring the process size before and after (being careful to adjust >> for the list or other data structure required to keep those objects >> alive). > > I have now written a Django application to measure the effect of PEP > 393, using the debug mode (to find all strings), and sys.getsizeof: > > https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof/count/views.py > > The results for 3.3 and pep-393 are attached. > > The Django app is small in every respect: trivial ORM, very few > objects (just for the sake of exercising the ORM at all), > no templating, short strings. The memory snapshot is taken in > the middle of a request. > > The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE. For comparison, could you run the test of the unmodified Python 3.3 on a 16-bit Py_UNICODE version as well ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2011) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 36 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From solipsis at pitrou.net Mon Aug 29 22:54:13 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 22:54:13 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5BF741.50209@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> <4E5BF741.50209@v.loewis.de> Message-ID: <20110829225413.689d073c@pitrou.net> On Mon, 29 Aug 2011 22:32:01 +0200 "Martin v. L?wis" wrote: > I have now written a Django application to measure the effect of PEP > 393, using the debug mode (to find all strings), and sys.getsizeof: > > https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof/count/views.py > > The results for 3.3 and pep-393 are attached. This looks very nice. Is 3.3 a wide build? (how about a narrow build?) (is it with your own port of Django to py3k, or is there an official branch for it?) Regards Antoine. From s.brunthaler at uci.edu Mon Aug 29 23:05:20 2011 From: s.brunthaler at uci.edu (stefan brunthaler) Date: Mon, 29 Aug 2011 14:05:20 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <4E5BF9F7.9020608@v.loewis.de> References: <4E5BF9F7.9020608@v.loewis.de> Message-ID: > The question really is whether this is an all-or-nothing deal. If you > could identify smaller parts that can be applied independently, interest > would be higher. > Well, it's not an all-or-nothing deal. In my current architecture, I can selectively enable most of the optimizations as I see fit. The only pre-requisite (in my implementation) is that I have two dispatch loops with a changed instruction format. It is, however, not a technical necessity, just the way I implemented it. Basically, you can choose whatever you like best, and I could extract that part. I am just offering to add all the things that I have done :) > Also, I'd be curious whether your techniques help or hinder a potential > integration of a JIT generator. > This is something I have previously frequently discussed with several JIT people. IMHO, having my optimizations in-place also helps a JIT compiler, since it can re-use the information I gathered to generate more aggressively optimized native machine code right away (the inline caches can be generated with the type information right away, some functions could be inlined with the guard statements subsumed, etc.) Another benefit could be that the JIT compiler can spend longer time on generating code, because the interpreter is already faster (so in some cases it would probably not make sense to include a non-optimizing fast and simple JIT compiler). There are others on the list, who probably can/want to comment on this, too. That aside, I think that while having a JIT is an important goal, I can very well imagine scenarios where the additional memory consumption (for the generated native machine code) of a JIT for each process (I assume that the native machine code caches are not shared) hinders scalability. I have in fact no data to back this up, but I think that would be an interesting trade off, say if I have 30% gain in performance without substantial additional memory requirements on my existing hardware, compared to higher achievable speedups that require more machines, though. Regards, --stefan From solipsis at pitrou.net Mon Aug 29 23:14:20 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 23:14:20 +0200 Subject: [Python-Dev] Python 3 optimizations continued... References: Message-ID: <20110829231420.20c3516a@pitrou.net> On Mon, 29 Aug 2011 11:33:14 -0700 stefan brunthaler wrote: > * The optimized dispatch routine has a changed instruction format > (word-sized instead of bytecodes) that allows for regular instruction > decoding (without the HAS_ARG-check) and inlinind of some objects in > the instruction format on 64bit architectures. Having a word-sized "bytecode" format would probably be acceptable in itself, so if you want to submit a patch for that, go ahead. Regards Antoine. From greg.ewing at canterbury.ac.nz Mon Aug 29 23:17:24 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 30 Aug 2011 09:17:24 +1200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: <4E5C01E4.2050106@canterbury.ac.nz> Guido van Rossum wrote: > (Just like Python's own .h files -- > e.g. the extensive renaming of the Unicode APIs depending on > narrow/wide build) How does Cython deal with these? Pyrex/Cython deal with it by generating C code that includes the relevant headers, so the C compiler expands all the macros, interprets the struct declarations, etc. All you need to do when writing the .pyx file is follow the same API that you would if you were writing C code to use the library. -- Greg From barry at python.org Mon Aug 29 23:18:33 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 17:18:33 -0400 Subject: [Python-Dev] PEP 3151 from the BDFOP In-Reply-To: <20110824015756.51cdceac@pitrou.net> References: <20110823170357.3b3ab2fc@resist.wooz.org> <20110824015756.51cdceac@pitrou.net> Message-ID: <20110829171833.5e0cc40d@resist.wooz.org> On Aug 24, 2011, at 01:57 AM, Antoine Pitrou wrote: >> One guiding principle for me is that we should keep the abstraction as thin >> as possible. In particular, I'm concerned about mapping multiple errnos >> into a single Error. For example both EPIPE and ESHUTDOWN mapping to >> BrokePipeError, or EACESS or EPERM to PermissionError. I think we should >> resist this, so that one errno maps to exactly one Error. Where grouping >> is desired, Python already has mechanisms to deal with that, >> e.g. superclasses and multiple inheritance. Therefore, I think it would be >> better to have >> >> + FileSystemPermissionError >> + AccessError (EACCES) >> + PermissionError (EPERM) > >I'm not sure that's a good idea: Was it the specific grouping under FileSystemPermissionError that you're objecting to, or the "keep the abstraction thin" principle? Let's say we threw out the idea of FSPE superclass, would you still want to collapse EACCES and EPERM into PermissionError, or would separate exceptions for each be okay? It's still pretty easy to catch both in one except clause, and it won't be too annoying if it's rare. >Yes, FileSystemError might be removed. I thought that it would be >useful, in some library routines, to catch all filesystem-related >errors indistinctly, but it's not a complete catchall actually (for >example, AccessError is outside of the FileSystemError subtree). Reading your IRC message (sorry, I was afk) it sounds like you think FileSystemError can be removed. I like keeping the hierarchy flat. >> Similarly, I think it would be helpful to have the errno name (e.g. ENOENT) >> in the error message string. That way, it won't get in the way for most >> code, but would be usefully printed out for uncaught exceptions. > >Agreed, but I think that's a feature request quite orthogonal from the >PEP. The errno *number* is still printed as it was before: > >>>> open("foo") >Traceback (most recent call last): > File "", line 1, in >FileNotFoundError: [Errno 2] No such file or directory: 'foo' > >(see e.g. http://bugs.python.org/issue12762) True, but since you're going to be creating a bunch of new exception classes, it should be relatively painless to give them a better str. Thanks for pointing out that bug; I agree with it. >> A second guiding principle should be that careful code that works in Python >> 3.2 must continue to work in Python 3.3 once PEP 3151 is accepted, but also >> for Python 2 code ported straight to Python 3.3. > >I don't porting straight to 3.3 would make a difference, especially now >that the idea of deprecating old exception names has been abandoned. Cool. >> Do be prepared for complaints about compatibility for careless code though >> - there's a ton of that out in the wild, and people will always complain >> with their "working" code breaks due to an upgrade. Be *very* explicit >> about this in the release notes and NEWS file, and put your asbestos >> underoos on. > >I'll take care about that :) :) >> Have you considered the impact of this PEP on other Python implementations? >> My hazy memory of Jython tells me that errnos don't really leak into Java >> and thus Jython much, but what about PyPy and IronPython? E.g. step 1's >> deprecation strategy seems pretty CPython-centric. > >Alternative implementations already have to implement errno codes in a >way or another if they want to have a chance of running existing code. >So I don't think the PEP makes much of a difference for them. >But their implementors can give their opinion on this. Let's give them a little more time to chime in (hopefully, they are reading this thread). We needn't wait too long though. >> As for step 1 (coalescing the errors). This makes sense and I'm generally >> agreeable, but I'm wondering whether it's best to re-use IOError for this >> rather than introduce a new exception. Not that I can think of a good name >> for that. I'm just not totally convinced that existing code when upgrading >> to Python 3.3 won't introduce silent failures. If an existing error is to >> be re-used for this, I'm torn on whether IOError or OSError is a better >> choice. Popularity aside, OSError *feels* more right. > >I don't have any personal preference. Previous discussions seemed to >indicate people preferred IOError. But changing the implementation to >OSError would be simple. I agree OSError feels slightly more right, as >in more generic. Thanks for making this change in the PEP. >> And that anything raising an exception (e.g. via PyErr_SetFromErrno) other >> than the new ones will raise IOError? > >I'm not sure I understand the question precisely. My question mostly was about raising OSError (as the current PEP states) with an errno that does *not* map to one of the new exceptions. In that case, I don't think there's anything you could raise other than exactly OSError, right? >The errno mapping mechanism is implemented in IOError.__new__, but it gets >called only if the class is exactly IOError, not a subclass: > >>>> IOError(errno.EPERM, "foo") >PermissionError(1, 'foo') >>>> class MyIOError(IOError): pass >... >>>> MyIOError(errno.EPERM, "foo") >MyIOError(1, 'foo') > >Using IOError.__new__ is the easiest way to ensure that all code >raising IO errors takes advantage of the errno mapping. Otherwise you >may get APIs raising the proper subclasses, and other APIs always >raising base IOError (it doesn't happen often, but some Python >library code raises an IOError with an explicit errno). > >> I also think that rather than transforming exception when raised from >> Python, i.e. via __new__ hackery, perhaps it should be a ValueError in its >> own right to raise IOError with an error represented by one of the >> subclasses. > >That would make it harder to keep compatibility while adding new >subclasses in future Python versions. Imagine a lot of people lobby for >a dedicated EBADF subclass and obtain it, then IOError(EBADF, "some >message") would suddenly raise a ValueError. Or do I misunderstand your >proposal? Somewhat. FWIW, this is the part that I'm most uncomfortable with. So, for raising OSError with an errno mapping to one of the subclasses, it appears to break the "explicit is better than implicit" principle, and I think it could lead to hard-to-debug or understand code. You'll look at code that raises OSError, but the exception that gets printed will be one of the subclasses. I'm afraid that if you don't know that this is happening, you're going to think you're going crazy. The other half is, let's say raising FileNotFoundError with the EEXIST errno. I'm guessing that the __init__'s for the new OSError subclasses will not have an `errno` attribute, so there's no way you can do that, but the PEP does not discuss this. It probably should. >> I found more examples of ECHILD and ESRCH than the >> former two. How'd you like to add those two to make your BDFOP happy? :) > >Wow, I didn't know ESRCH. >How would you call the respective exceptions? >- ChildProcessError for ECHILD? The Linux wait(2) manpage says: ECHILD (for wait()) The calling process does not have any unwaited-for children. ECHILD (for waitpid() or waitid()) The process specified by pid (wait? pid()) or idtype and id (waitid()) does not exist or is not a child of the calling process. (This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.) >- ProcessLookupError for ESRCH? The Linux kill(2) manpage says: ESRCH The pid or process group does not exist. Note that an existing process might be a zombie, a process which already committed termination, but has not yet been wait(2)ed for. So in a sense, both are lookup errors, though I think it's going too far to multiply inherit from LookupError. Maybe ChildWaitError or ChildLookupError for the former? ProcessLookupError seems good to me. >> What if all the errno symbolic names were mapped as attributes on IOError? >> The only advantage of that would be to eliminate the need to import errno, >> or for the ugly `e.errno == errno.ENOENT` stuff. That would then be >> rewritten as `e.errno == IOError.ENOENT`. A mild savings to be sure, but >> still. > >Hmm, I guess that's explorable as an orthogonal idea. Cool. How should we capture that? >> How dumb/useless/unworkable would it be to add an __future__ to switch from >> the old hierarchy to the new one? Probably pretty. ;) > >Well, the hierarchy is built-in, since it's about standard exceptions. >Also, you usually get the exception from some library API, so a >__future__ in your own module would not achieve much. > >> What about an api that applications/libraries could use to add additional >> exceptions based on other errnos they cared about? This could be consulted in >> PyErr_SetFromErrno() and raised instead of IOError. Okay, yeah, that's >> probably pretty dumb too. > >The problem is that behaviour becomes inconsistent accross libraries. >I'm not sure that's very helpful to the user. Yeah, on further reflection, let's forget those last two ideas. ;) Okay, so here's what's still outstanding for me: * Should we eliminate FileSystemError? (probably "yes") * Should we ensure one errno == one exception? - i.e. separate EACCES and EPERM - i.e. separate EPIPE and ESHUTDOWN * Should the str of the new exception subclasses be improved (e.g. to include the symbolic name instead of the errno first)? * Is the OSError.__new__() hackery a good idea? * Should the PEP define the signature of the new exceptions (e.g. to prohibit passing in an incorrect errno to an OSError subclass)? * Can we add ECHILD and ESRCH, and if so, what names should we use? * Where can we capture the idea of putting the symbolic names on OSError class attributes, or is it a dumb idea that should be ditched? * How long should we wait for other Python implementations to chime in? Cheers, -Barry From barry at python.org Mon Aug 29 23:21:05 2011 From: barry at python.org (Barry Warsaw) Date: Mon, 29 Aug 2011 17:21:05 -0400 Subject: [Python-Dev] PEP 3151 from the BDFOP In-Reply-To: References: <20110823170357.3b3ab2fc@resist.wooz.org> <20110824015756.51cdceac@pitrou.net> Message-ID: <20110829172105.6812cadd@resist.wooz.org> On Aug 24, 2011, at 12:51 PM, Nick Coghlan wrote: >On Wed, Aug 24, 2011 at 9:57 AM, Antoine Pitrou wrote: >> Using IOError.__new__ is the easiest way to ensure that all code >> raising IO errors takes advantage of the errno mapping. Otherwise you >> may get APIs raising the proper subclasses, and other APIs always >> raising base IOError (it doesn't happen often, but some Python >> library code raises an IOError with an explicit errno). > >It's also the natural place to put the errno->exception type mapping >so that existing code will raise the new errors without requiring >modification. We could spell it as a new class method ("from_errno" or >similar), but there isn't any ambiguity in doing it directly in >__new__, so a class method seems pointlessly inconvenient. As I mentioned, my main concern with this is the surprise factor for people debugging and reading the code. A class method would solve that, but looks uglier and doesn't work with existing code. -Barry From guido at python.org Mon Aug 29 23:20:49 2011 From: guido at python.org (Guido van Rossum) Date: Mon, 29 Aug 2011 14:20:49 -0700 Subject: [Python-Dev] SWIG (was Re: Ctypes and the stdlib) In-Reply-To: References: Message-ID: Thanks for an insightful post, Dave! I took the liberty of mentioning it on Google+: https://plus.google.com/115212051037621986145/posts/NyEiLEfR6HF (PS. Anyone wanting a G+ invite, go here: https://plus.google.com/i/7w3niYersIA:8fxDrfW-6TA ) --Guido On Mon, Aug 29, 2011 at 5:41 AM, David Beazley wrote: > On Mon, Aug 29, 2011 at 12:27 PM, Guido van Rossum wrote: > >> I wonder if for >> this particular purpose SWIG isn't the better match. (If SWIG weren't >> universally hated, even by its original author. :-) > > Hate is probably a strong word, but as the author of Swig, let me chime in here ;-). ? I think there are probably some lessons to be learned from Swig. > > As Nick noted, Swig is best suited when you have control over both sides (C/C++ and Python) of whatever code you're working with. ?In fact, the original motivation for ?Swig was to give application programmers (scientists in my case), a means for automatically generating the Python bindings to their code. ?However, there was one other important assumption--and that was the fact that all of your "real code" was going to be written in C/C++ and that the Python scripting interface was just an optional add-on (perhaps even just a throw-away thing). ?Keep in mind, Swig was first created in 1995 and at that time, the use of Python (or any similar language) was a pretty radical idea in the sciences. ?Moreover, there was a lot of legacy code that people just weren't going to abandon. ?Thus, I always viewed Swig as a kind of transitional vehicle for getting people to use Python who might otherwise not even consider it. ? Getting back to Nick's point though, to really use Swig effectiv > ?ely, it was always known that you might have to reorganize or refactor your C/C++ code to make it more Python friendly. ?However, due to the automatic wrapper generation, you didn't have to do it all at once. ?Basically your code could organically evolve and Swig would just keep up with whatever you were doing. ?In my projects, we'd usually just tuck Swig away in some Makefile somewhere and forget about it. > > One of the major complexities of Swig is the fact that it attempts to parse C/C++ header files. ? This very notion is actually a dangerous trap waiting for anyone who wants to wander into it. ?You might look at a header file and say, well how hard could it be to just grab a few definitions out of there? ? I'll just write a few regexs or come up with some simple hack for recognizing function definitions or something. ? Yes, you can do that, but you're immediately going to find that whatever approach you take starts to break down into horrible corner cases. ? Swig started out like this and quickly turned into a quagmire of esoteric bug reports. ?All sorts of problems with preprocessor macros, typedefs, missing headers, and other things. ?For awhile, I would get these bug reports that would go something like "I had this C++ class inside a namespace with an abstract method taking a typedef'd const reference to this smart pointer ..... and Swig broke." ? Hell, I can't even underst > ?and the bug report let alone know how to fix it. ?Almost all of these bugs were due to the fact that Swig started out as a hack and didn't really have any kind of solid conceptual foundation for how it should be put together. > > If you flash forward a bit, from about 2001-2004 there was a very serious push to fix these kinds of issues. ?Although it was not a complete rewrite of Swig, there were a huge number of changes to how it worked during this time. ?Swig grew a fully compatible C++ preprocessor that fully supported macros ? A complete C++ type system was implemented including support for namespaces, templates, and even such things as template partial specialization. ?Swig evolved into a multi-pass compiler that was doing all sorts of global analysis of the interface. ? Just to give you an idea, Swig would do things such as automatically detect/wrap C++ smart pointers. ?It could wrap overloaded C++ methods/function. ?Also, if you had a C++ class with virtual methods, it would only make one Python wrapper function and then reuse across all wrapped subclasses. > > Under the covers of all of this, the implementation basically evolved into a sophisticated macro preprocessor coupled with a pattern matching engine built on top of the C++ type system. ? For example, you could write patterns that matched specific C++ types (the much hated "typemap" feature) and you could write patterns that matched entire C++ declarations. ?This whole pattern matching approach had a huge power if you knew what you were doing. ?For example, I had a graduate student working on adding "contracts" to Swig--something that was being funded by a NSF grant. ? It was cool and mind boggling all at once. > > In hindsight however, I think the complexity of Swig has exceeded anyone's ability to fully understand it (including my own). ?For example, to even make sense of what's happening, you have to have a pretty solid grasp of the C/C++ type system (easier said than done). ? Couple that with all sorts of crazy pattern matching, low-level code fragments, and a ton of macro definitions, your head will literally explode if you try to figure out what's happening. ? So far as I know, recent versions of Swig have even combined all of this type-pattern matching with regular expressions. ?I can't even fathom it. > > Sadly, my involvement was Swig was an unfortunate casualty of my academic career biting the dust. ?By 2005, I was so burned out of working on it and so sick of what I was doing, I quite literally put all of my computer stuff aside to go play in a band for a few years. ? After a few years, I came back to programming (obviously), but not to keep working on the same stuff. ? In particularly, I will die quite happy if I never have to look at another line of C++ code ever again. ?No, I would much rather fling my toddlers, ride my bike, play piano, or do just about anything than ever do that again. ? Although I still subscribe the Swig mailing lists and watch what's happening, I'm not active with it at the moment. > > I've sometimes thought it might be interesting to create a Swig replacement purely in Python. ?When I work on the PLY project, this is often what I think about. ? In that project, I've actually built a number of the parsing tools that would be useful in creating such a thing. ? The only catch is that when I start thinking along these lines, I usually reach a point where I say "nah, I'll just write the whole application in Python." > > Anyways, this is probably way more than anyone wants to know about Swig. ? Getting back to the original topic of using it to make standard library modules, I just don't know. ? I think you probably could have some success with an automatic code generator of some kind. ?I'm just not sure it should take the Swig approach of parsing C++ headers. ?I think you could do better. > > Cheers, > Dave > > P.S. By the way, if people want to know a lot more about Swig internals, they should check out the PyCon 2008 presentation I gave about it. ?http://www.dabeaz.com/SwigMaster/ > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) From solipsis at pitrou.net Mon Aug 29 23:39:33 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 29 Aug 2011 23:39:33 +0200 Subject: [Python-Dev] PEP 3151 from the BDFOP References: <20110823170357.3b3ab2fc@resist.wooz.org> <20110824015756.51cdceac@pitrou.net> <20110829171833.5e0cc40d@resist.wooz.org> Message-ID: <20110829233933.54c69a99@pitrou.net> On Mon, 29 Aug 2011 17:18:33 -0400 Barry Warsaw wrote: > On Aug 24, 2011, at 01:57 AM, Antoine Pitrou wrote: > > >> One guiding principle for me is that we should keep the abstraction as thin > >> as possible. In particular, I'm concerned about mapping multiple errnos > >> into a single Error. For example both EPIPE and ESHUTDOWN mapping to > >> BrokePipeError, or EACESS or EPERM to PermissionError. I think we should > >> resist this, so that one errno maps to exactly one Error. Where grouping > >> is desired, Python already has mechanisms to deal with that, > >> e.g. superclasses and multiple inheritance. Therefore, I think it would be > >> better to have > >> > >> + FileSystemPermissionError > >> + AccessError (EACCES) > >> + PermissionError (EPERM) > > > >I'm not sure that's a good idea: > > Was it the specific grouping under FileSystemPermissionError that you're > objecting to, or the "keep the abstraction thin" principle? The former. EPERM is generally returned for things which aren't filesystem-related. (although I also think separating EACCES and EPERM is of little value *in practice*) > Let's say we > threw out the idea of FSPE superclass, would you still want to collapse EACCES > and EPERM into PermissionError, or would separate exceptions for each be okay? I have a preference for the former, but am not against the latter. I just think that, given AccessError and PermissionError, most users won't know up front which one they should care about. > It's still pretty easy to catch both in one except clause, and it won't be too > annoying if it's rare. Indeed. > Reading your IRC message (sorry, I was afk) it sounds like you think > FileSystemError can be removed. I like keeping the hierarchy flat. Ok. It can be reintroduced later on. (the main reason why I think it can be removed is that EACCES in itself is often tied to filesystem access rights; so the EACCES exception class would have to be a subclass of FileSystemError, while the EPERM one should not :-)) > >>>> open("foo") > >Traceback (most recent call last): > > File "", line 1, in > >FileNotFoundError: [Errno 2] No such file or directory: 'foo' > > > >(see e.g. http://bugs.python.org/issue12762) > > True, but since you're going to be creating a bunch of new exception classes, > it should be relatively painless to give them a better str. Thanks for > pointing out that bug; I agree with it. Well, the str right now is exactly the same as OSError's. > My question mostly was about raising OSError (as the current PEP states) with > an errno that does *not* map to one of the new exceptions. In that case, I > don't think there's anything you could raise other than exactly OSError, > right? And indeed, that's what the implementation does :) > So, for raising OSError with an errno mapping to one of the subclasses, it > appears to break the "explicit is better than implicit" principle, and I think > it could lead to hard-to-debug or understand code. You'll look at code that > raises OSError, but the exception that gets printed will be one of the > subclasses. I'm afraid that if you don't know that this is happening, you're > going to think you're going crazy. Except that it only happens if you use a recognized errno. For example if you do: >>> OSError(errno.ENOENT, "not found") FileNotFoundError(2, 'not found') Not if you just pass a message (or anything else, actually): >>> OSError("some message") OSError('some message',) But if you pass an explicit errno, then the subclass doesn't appear that surprising, does it? > The other half is, let's say raising FileNotFoundError with the EEXIST errno. > I'm guessing that the __init__'s for the new OSError subclasses will not have > an `errno` attribute, so there's no way you can do that, but the PEP does not > discuss this. Actually, the __new__ and the __init__ are exactly the same as OSError's: >>> e = FileNotFoundError("some message") >>> e.errno >>> e = FileNotFoundError(errno.ENOENT, "some message") >>> e.errno 2 > >Wow, I didn't know ESRCH. > >How would you call the respective exceptions? > >- ChildProcessError for ECHILD? > [...] > > >- ProcessLookupError for ESRCH? > [...] > > So in a sense, both are lookup errors, though I think it's going too far to > multiply inherit from LookupError. Maybe ChildWaitError or ChildLookupError > for the former? ProcessLookupError seems good to me. Ok. > >> What if all the errno symbolic names were mapped as attributes on IOError? > >> The only advantage of that would be to eliminate the need to import errno, > >> or for the ugly `e.errno == errno.ENOENT` stuff. That would then be > >> rewritten as `e.errno == IOError.ENOENT`. A mild savings to be sure, but > >> still. > > > >Hmm, I guess that's explorable as an orthogonal idea. > > Cool. How should we capture that? A separate PEP perhaps, or more appropriately (IMHO) a tracker entry, since it's just about enriching the attributes of an existing type. I think it's a bit weird to define a whole lot of constants on a built-in type, though. > Okay, so here's what's still outstanding for me: > > * Should we eliminate FileSystemError? (probably "yes") Ok. > * Should we ensure one errno == one exception? > - i.e. separate EACCES and EPERM > - i.e. separate EPIPE and ESHUTDOWN I think that's unhelpful (or downright confusing: what is, intuitively, the difference between an "AccessError" and a "PermissionError"?) to most users, and users to which it is helpful already know how to access the errno. > * Should the str of the new exception subclasses be improved (e.g. to include > the symbolic name instead of the errno first)? As I said, I think it's orthogonal, but I would +1 on including the symbolic name instead of the integer. > * Is the OSError.__new__() hackery a good idea? I think it is, since it also takes care about Python code raising OSErrors, but YMMV. > * Should the PEP define the signature of the new exceptions (e.g. to prohibit > passing in an incorrect errno to an OSError subclass)? The OSError constructor, pre-PEP, is very laxist, and I took care to keep it like that in the implementation. Apparently it's a feature to help migrating old code. > * Can we add ECHILD and ESRCH, and if so, what names should we use? I think the suggested names are ok. > * Where can we capture the idea of putting the symbolic names on OSError class > attributes, or is it a dumb idea that should be ditched? I think it's a separate task altogether, although I'm in favour of it. > * How long should we wait for other Python implementations to chime in? A couple of weeks? I will soon leave on holiday until the end of September anyway. Regards Antoine. From victor.stinner at haypocalc.com Mon Aug 29 23:57:36 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Mon, 29 Aug 2011 23:57:36 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: Message-ID: <201108292357.36628.victor.stinner@haypocalc.com> Le lundi 29 ao?t 2011 19:35:14, stefan brunthaler a ?crit : > pretty much a year ago I wrote about the optimizations I did for my > PhD thesis that target the Python 3 series interpreters Does it speed up Python? :-) Could you provide numbers (benchmarks)? Victor From victor.stinner at haypocalc.com Tue Aug 30 00:20:46 2011 From: victor.stinner at haypocalc.com (Victor Stinner) Date: Tue, 30 Aug 2011 00:20:46 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5BE9D8.5050309@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <4E5B5364.9040100@haypocalc.com> <4E5BE9D8.5050309@v.loewis.de> Message-ID: <201108300020.46140.victor.stinner@haypocalc.com> Le lundi 29 ao?t 2011 21:34:48, vous avez ?crit : > >> Those haven't been ported to the new API, yet. Consider, for example, > >> d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test; > >> with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this > >> is a 25% speedup for PEP 393. > > > > If I understand correctly, the performance now highly depend on the used > > characters? A pure ASCII string is faster than a string with characters > > in the ISO-8859-1 charset? > > How did you infer that from above paragraph??? ASCII and Latin-1 are > mostly identical in terms of performance - the ASCII decoder should be > slightly slower than the Latin-1 decoder, since the ASCII decoder needs > to check for errors, whereas the Latin-1 decoder will never be > confronted with errors. I don't compare ASCII and ISO-8859-1 decoders. I was asking if decoding b'abc' from ISO-8859-1 is faster than decoding b'ab\xff' from ISO-8859-1, and if yes: why? Your patch replaces PyUnicode_New(size, 255) ... memcpy(), by PyUnicode_FromUCS1(). I don't understand how it makes Python faster: PyUnicode_FromUCS1() does first scan the input string for the maximum code point. I suppose that the main difference is that the ISO-8859-1 encoded string is stored as the UTF-8 encoded string (shared pointer) if all characters of the string are ASCII characters. In this case, encoding the string to UTF-8 doesn't cost anything, we already have the result. Am I correct? Victor From stefan at brunthaler.net Tue Aug 30 00:23:10 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Mon, 29 Aug 2011 15:23:10 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <201108292357.36628.victor.stinner@haypocalc.com> References: <201108292357.36628.victor.stinner@haypocalc.com> Message-ID: > Does it speed up Python? :-) Could you provide numbers (benchmarks)? > Yes, it does ;) The maximum overall speedup I achieved was by a factor of 2.42 on my i7-920 for the spectralnorm benchmark of the computer language benchmark game. Others from the same set are: binarytrees: 1.9257 (1.9891) fannkuch: 1.6509 (1.7264) fasta: 1.5446 (1.7161) mandelbrot: 2.0040 (2.1847) nbody: 1.6165 (1.7602) spectralnorm: 2.2538 (2.4176) --- overall: 1.8213 (1.9382) (The first number is the combination of all optimizations, the one in parentheses is with my last optimization [Interpreter Instruction Scheduling] enabled, too.) For a comparative real world benchmark I tested Martin von Loewis' django port (there are not that many meaningful Python 3 real world benchmarks) and got a speedup of 1.3 (without IIS). This is reasonably well, US got a speedup of 1.35 on this benchmark. I just checked that pypy-c-latest on 64 bit reports 1.5 (the pypy-c-jit-latest figures seem to be not working currently or *really* fast...), but I cannot tell directly how that relates to speedups (it just says "less is better" and I did not quickly find an explanation). Since I did this benchmark last year, I have spent more time investigating this benchmark and found that I could do better, but I would have to guess as to how much (An interesting aside though: on this benchmark, the executable never grew on more than 5 megs of memory usage, exactly like the vanilla Python 3 interpreter.) hth, --stefan From meadori at gmail.com Tue Aug 30 00:44:54 2011 From: meadori at gmail.com (Meador Inge) Date: Mon, 29 Aug 2011 17:44:54 -0500 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Sat, Aug 27, 2011 at 11:58 PM, Terry Reedy wrote: > Dan, I once had the more or less the same opinion/question as you with > regard to ctypes, but I now see at least 3 problems. > > 1) It seems hard to write it correctly. There are currently 47 open ctypes > issues, with 9 being feature requests, leaving 38 behavior-related issues. > Tom Heller has not been able to work on it since the beginning of 2010 and > has formally withdrawn as maintainer. No one else that I know of has taken > his place. I am trying to work through getting these issues resolved. The hard part so far has been getting reviews and commits. The follow patches are awaiting review (the patch for issue 11241 has been accepted, just not applied): 1. http://bugs.python.org/issue9041 2. http://bugs.python.org/issue9651 3. http://bugs.python.org/issue11241 I am more than happy to keep working through these issues, but I need some help getting the patches actually applied since I don't have commit rights. -- # Meador From ncoghlan at gmail.com Tue Aug 30 01:47:10 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Aug 2011 09:47:10 +1000 Subject: [Python-Dev] PEP 3151 from the BDFOP In-Reply-To: <20110829171833.5e0cc40d@resist.wooz.org> References: <20110823170357.3b3ab2fc@resist.wooz.org> <20110824015756.51cdceac@pitrou.net> <20110829171833.5e0cc40d@resist.wooz.org> Message-ID: On Tue, Aug 30, 2011 at 7:18 AM, Barry Warsaw wrote: > Okay, so here's what's still outstanding for me: > > * Should we eliminate FileSystemError? (probably "yes") I've also been persuaded that this isn't a generally meaningful categorisation, so +1 for dropping it. ConnectionError is worth keeping, though. > * Should we ensure one errno == one exception? > ?- i.e. separate EACCES and EPERM > ?- i.e. separate EPIPE and ESHUTDOWN I think the concept of a 1:1 mapping is a complete non-starter, since "OSError" is always going to map to multiple errnos (i.e. everything that hasn't been assigned to a specific subclass). Maintaining the class categorisation down to a certain level for ease of separate handling is worthwhile, but below that point it's better to let people realise that they need to understand the subtleties of the different errno values. > * Should the str of the new exception subclasses be improved (e.g. to include > ?the symbolic name instead of the errno first)? I'd say that's a distinct RFE on the tracker (since it applies regardless of the acceptance or rejection of PEP 3151). Good idea in principle, though. > * Is the OSError.__new__() hackery a good idea? I agree it's a little magical, but I also think the PEP becomes pretty useless without it. If OSError.__new__ handles the mapping, then most code (including C code) doesn't need to change - it will raise the new subclasses automatically. If we demand that all exception *raising* code be changed, then exception *catching* code will have a hard time assuming that the new subclasses are going to be raised correctly instead of a top level OSError. To make that transition feasible, I think we *need* to make it as hard as we can (if not impossible) to raise OSError instances with defined errno values that *don't* conform to the new hierarchy so that 3.3+ exception catching code doesn't need to worry about things like ENOENT being raised as OSError instead of FileNotFoundError. Only code that also supports earlier versions should need to resort to inspecting the errno values for the coarse distinctions that the PEP provides via the new class hierarchy. > * Should the PEP define the signature of the new exceptions (e.g. to prohibit > ?passing in an incorrect errno to an OSError subclass)? Unfortunately, I think the variations in errno details across platforms mean that being too restrictive in this space would cause more problems than it solves. So it may be wiser to technically allow people to do silly things like "raise FileNotFoundError(errno.EPIPE)" with the admonition not to actually do that because it is obscure and confusing. "Consenting adults", etc. > * Can we add ECHILD and ESRCH, and if so, what names should we use? +1 for ChildProcessError and ProcessLookupError (as peer exceptions on the tier directly below OSError) > * Where can we capture the idea of putting the symbolic names on OSError class > ?attributes, or is it a dumb idea that should be ditched? "Tracker RFE" for the former and "maybe" for the latter. With this PEP, the need for direct inspection of errno values should be significantly reduced in most code, so importing errno shouldn't be necessary. > * How long should we wait for other Python implementations to chime in? "Until Antoine gets back from his holiday" sounds reasonable to me. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From guido at python.org Tue Aug 30 01:48:11 2011 From: guido at python.org (Guido van Rossum) Date: Mon, 29 Aug 2011 16:48:11 -0700 Subject: [Python-Dev] Ctypes and the stdlib In-Reply-To: References: <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: On Mon, Aug 29, 2011 at 2:39 AM, Stefan Behnel wrote: > Guido van Rossum, 29.08.2011 04:27: >> Hm, the main use that was proposed here for ctypes is to wrap existing >> libraries (not to create nicer APIs, that can be done in pure Python >> on top of this). > > The same applies to Cython, obviously. The main advantage of Cython over > ctypes for this is that the Python-level wrapper code is also compiled into > C, so whenever the need for a thicker wrapper arises in some part of the > API, you don't loose any performance in intermediate layers. Yes, this is a very nice advantage. The only advantage that I can think of for ctypes is that it doesn't require a toolchain -- you can just write the Python code and get going. With Cython you will always have to invoke the Cython compiler. Another advantage may be that it works *today* for PyPy -- I don't know the status of Cython for PyPy. Also, (maybe this was answered before?), how well does Cython deal with #include files (especially those you don't have control over, like the ones typically required to use some lib.so safely on all platforms)? -- --Guido van Rossum (python.org/~guido) From ncoghlan at gmail.com Tue Aug 30 02:00:28 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Aug 2011 10:00:28 +1000 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <20110829231420.20c3516a@pitrou.net> References: <20110829231420.20c3516a@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 7:14 AM, Antoine Pitrou wrote: > On Mon, 29 Aug 2011 11:33:14 -0700 > stefan brunthaler wrote: >> * The optimized dispatch routine has a changed instruction format >> (word-sized instead of bytecodes) that allows for regular instruction >> decoding (without the HAS_ARG-check) and inlinind of some objects in >> the instruction format on 64bit architectures. > > Having a word-sized "bytecode" format would probably be acceptable in > itself, so if you want to submit a patch for that, go ahead. Although any such patch should discuss how it compares with Cesare's work on wpython. Personally, I *like* CPython fitting into the "simple-and-portable" niche in the Python interpreter space. Armin Rigo made the judgment years ago that CPython was a poor platform for serious optimisation when he stopped working on Psyco and started PyPy instead, and I think the contrasting fates of PyPy and Unladen Swallow have borne out that opinion. Significantly increasing the complexity of CPython for speed-ups that are dwarfed by those available through PyPy seems like a poor trade-off to me. At a bare minimum, I don't think any significant changes should be made under the "it will be faster" justification until the bulk of the real-world benchmark suite used for speed.pypy.org is available for Python 3. (Wasn't there a GSoC project about that?) Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From guido at python.org Tue Aug 30 02:02:26 2011 From: guido at python.org (Guido van Rossum) Date: Mon, 29 Aug 2011 17:02:26 -0700 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: <4E5C01E4.2050106@canterbury.ac.nz> References: <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> Message-ID: On Mon, Aug 29, 2011 at 2:17 PM, Greg Ewing wrote: > Guido van Rossum wrote: >> >> (Just like Python's own .h files -- >> e.g. the extensive renaming of the Unicode APIs depending on >> narrow/wide build) How does Cython deal with these? > > Pyrex/Cython deal with it by generating C code that includes > the relevant headers, so the C compiler expands all the > macros, interprets the struct declarations, etc. All you > need to do when writing the .pyx file is follow the same > API that you would if you were writing C code to use the > library. Interesting. Then how does Pyrex/Cython typecheck your code at compile time? -- --Guido van Rossum (python.org/~guido) From stefan at brunthaler.net Tue Aug 30 02:25:21 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Mon, 29 Aug 2011 17:25:21 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> Message-ID: > Personally, I *like* CPython fitting into the "simple-and-portable" > niche in the Python interpreter space. Armin Rigo made the judgment > years ago that CPython was a poor platform for serious optimisation > when he stopped working on Psyco and started PyPy instead, and I think > the contrasting fates of PyPy and Unladen Swallow have borne out that > opinion. Significantly increasing the complexity of CPython for > speed-ups that are dwarfed by those available through PyPy seems like > a poor trade-off to me. > I agree with the trade-off, but the nice thing is that CPython's interpreter remains simple and portable using my optimizations. All of these optimizations are purely interpretative and the complexity of CPython is not affected much. (For example, I have an inline-cached version of BINARY_ADD that is called INCA_FLOAT_ADD [INCA being my abbreviation for INline CAching]; you don't actually have to look at its source code, since it is generated by my code generator but can by looking at instruction traces immediately tell what's going on.) So, the interpreter remains fully portable and any compatibility issues with C modules should not occur either. > At a bare minimum, I don't think any significant changes should be > made under the "it will be faster" justification until the bulk of the > real-world benchmark suite used for speed.pypy.org is available for > Python 3. (Wasn't there a GSoC project about that?) > Having more tests would surely be helpful, as already said, the most real-world stuff I can do is Martin's django patch (some of the other benchmarks though are from the shootout and I can [and did] run them, too {binarytrees, fannkuch, fasta, mandelbrot, nbody and spectralnorm}. I have also the AI benchmark from Unladden Swallow but no current figures.) Best, --stefan From tjreedy at udel.edu Tue Aug 30 02:28:16 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 29 Aug 2011 20:28:16 -0400 Subject: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork" In-Reply-To: References: <20110823205147.3349eaa8@pitrou.net> <1314131362.3485.36.camel@localhost.localdomain> <20110826175336.3af6be57@pitrou.net> <20110829191608.7916da73@pitrou.net> <1314638573.3551.14.camel@localhost.localdomain> Message-ID: On 8/29/2011 3:41 PM, Nir Aides wrote: > I am not familiar with the python-dev definition for deprecation, but Possible to planned eventual removal > when I used the word in the bug discussion I meant to advertize to > users that they should not mix threading and forking since that mix is > and will remain broken by design; I did not mean removal or crippling > of functionality. This would be a note or warning in the doc. You can suggest what and where to add something on an existing issue or a new one. -- Terry Jan Reedy From solipsis at pitrou.net Tue Aug 30 02:55:10 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Aug 2011 02:55:10 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> Message-ID: <20110830025510.638b41d9@pitrou.net> On Tue, 30 Aug 2011 10:00:28 +1000 Nick Coghlan wrote: > > > > Having a word-sized "bytecode" format would probably be acceptable in > > itself, so if you want to submit a patch for that, go ahead. > > Although any such patch should discuss how it compares with Cesare's > work on wpython. > Personally, I *like* CPython fitting into the "simple-and-portable" > niche in the Python interpreter space. Changing the bytecode width wouldn't make the interpreter more complex. > Armin Rigo made the judgment > years ago that CPython was a poor platform for serious optimisation > when he stopped working on Psyco and started PyPy instead, and I think > the contrasting fates of PyPy and Unladen Swallow have borne out that > opinion. Well, PyPy didn't show any significant achievements before they spent *much* more time on it than the Unladen Swallow guys did. Whether or not a good JIT is possible on top of CPython might remain a largely unanswered question. > Significantly increasing the complexity of CPython for > speed-ups that are dwarfed by those available through PyPy seems like > a poor trade-off to me. Some years ago we were waiting for Unladen Swallow to improve itself and be ported to Python 3. Now it seems we are waiting for PyPy to be ported to Python 3. I'm not sure how "let's just wait" is a good trade-off if someone proposes interesting patches (which, of course, remains to be seen). > At a bare minimum, I don't think any significant changes should be > made under the "it will be faster" justification until the bulk of the > real-world benchmark suite used for speed.pypy.org is available for > Python 3. (Wasn't there a GSoC project about that?) I'm not sure what the bulk is, but have you already taken a look at http://hg.python.org/benchmarks/ ? Regards Antoine. From greg at krypto.org Tue Aug 30 04:38:31 2011 From: greg at krypto.org (Gregory P. Smith) Date: Mon, 29 Aug 2011 19:38:31 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <4E5BF9F7.9020608@v.loewis.de> Message-ID: On Mon, Aug 29, 2011 at 2:05 PM, stefan brunthaler wrote: > > The question really is whether this is an all-or-nothing deal. If you > > could identify smaller parts that can be applied independently, interest > > would be higher. > > > Well, it's not an all-or-nothing deal. In my current architecture, I > can selectively enable most of the optimizations as I see fit. The > only pre-requisite (in my implementation) is that I have two dispatch > loops with a changed instruction format. It is, however, not a > technical necessity, just the way I implemented it. Basically, you can > choose whatever you like best, and I could extract that part. I am > just offering to add all the things that I have done :) > > +1 from me on going forward with your performance improvements. The more you can break them down into individual smaller patch sets the better as they can be reviewed and applied as needed. A prerequisites patch, a patch for the wide opcodes, etc.. For benchmarks given this is python 3, just get as many useful ones running as you can. Some in this thread seemed to give the impression that CPython performance is not something to care about. I disagree. I see CPython being the main implementation of Python used in most places for a long time. Improving its performance merely raises the bar to be met by other implementations if they want to compete. That is a good thing! -gps > > Also, I'd be curious whether your techniques help or hinder a potential > > integration of a JIT generator. > > > This is something I have previously frequently discussed with several > JIT people. IMHO, having my optimizations in-place also helps a JIT > compiler, since it can re-use the information I gathered to generate > more aggressively optimized native machine code right away (the inline > caches can be generated with the type information right away, some > functions could be inlined with the guard statements subsumed, etc.) > Another benefit could be that the JIT compiler can spend longer time > on generating code, because the interpreter is already faster (so in > some cases it would probably not make sense to include a > non-optimizing fast and simple JIT compiler). > There are others on the list, who probably can/want to comment on this, > too. > > That aside, I think that while having a JIT is an important goal, I > can very well imagine scenarios where the additional memory > consumption (for the generated native machine code) of a JIT for each > process (I assume that the native machine code caches are not shared) > hinders scalability. I have in fact no data to back this up, but I > think that would be an interesting trade off, say if I have 30% gain > in performance without substantial additional memory requirements on > my existing hardware, compared to higher achievable speedups that > require more machines, though. > > > Regards, > --stefan > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/greg%40krypto.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Aug 30 05:29:59 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Aug 2011 13:29:59 +1000 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <4E5BF9F7.9020608@v.loewis.de> Message-ID: On Tue, Aug 30, 2011 at 12:38 PM, Gregory P. Smith wrote: > Some in this thread seemed to give the impression that CPython performance > is not something to care about. I disagree. I see CPython being the main > implementation of Python used in most places for a long time. Improving its > performance merely raises the bar to be met by other implementations if they > want to compete. That is a good thing! Not the impression I intended to give. I merely want to highlight that we need to be careful that incremental increases in complexity are justified with real, measured performance improvements. PyPy has set the bar on how to do that - people that seriously want to make CPython faster need to focus on getting speed.python.org sorted *first* (so we know where we're starting) and *then* work on trying to improve CPython's numbers relative to that starting point. The PSF has the hardware to run the site, but, unless more has been going in the background than I am aware of, is still lacking trusted volunteers to do the following: 1. Getting codespeed up and running on the PSF hardware 2. Hooking it in to the CPython source control infrastructure 3. Getting a reasonable set of benchmarks running on 3.x (likely starting with the already ported set in Mercurial, but eventually we want the full suite that PyPy uses) 4. Once PyPy, Jython and IronPython offer 3.x compatible versions, start including them as well (alternatively, offer 2.x performance comparisons as well, although that's less interesting from a CPython point of view since it can't be used to guide future CPython optimisation efforts) Anecdotal, non-reproducible performance figures are *not* the way to go about serious optimisation efforts. Using a dedicated machine is vulnerable to architecture-specific idiosyncracies, but ad hoc testing on other systems can still be used as a sanity check. Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From greg.ewing at canterbury.ac.nz Tue Aug 30 07:55:20 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 30 Aug 2011 17:55:20 +1200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> Message-ID: <4E5C7B48.5080402@canterbury.ac.nz> Guido van Rossum wrote: > On Mon, Aug 29, 2011 at 2:17 PM, Greg Ewing wrote: >>All you >>need to do when writing the .pyx file is follow the same >>API that you would if you were writing C code to use the >>library. > > Interesting. Then how does Pyrex/Cython typecheck your code at compile time? You might be reading more into that statement than I meant. You have to supply Pyrex/Cython versions of the C declarations, either hand-written or generated by a tool. But you write them based on the advertised C API -- you don't have to manually expand macros, work out the low-level layout of structs, or anything like that (as you often have to do when using ctypes). -- Greg From greg.ewing at canterbury.ac.nz Tue Aug 30 07:57:28 2011 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 30 Aug 2011 17:57:28 +1200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> Message-ID: <4E5C7BC8.6010302@canterbury.ac.nz> Nick Coghlan wrote: > Personally, I *like* CPython fitting into the "simple-and-portable" > niche in the Python interpreter space. Me, too! I like that I can read the CPython source and understand what it's doing most of the time. Please don't screw that up by attempting to perform heroic optimisations. -- Greg From martin at v.loewis.de Tue Aug 30 08:20:46 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 30 Aug 2011 08:20:46 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <201108300020.46140.victor.stinner@haypocalc.com> References: <4E553FBC.7080501@v.loewis.de> <4E5B5364.9040100@haypocalc.com> <4E5BE9D8.5050309@v.loewis.de> <201108300020.46140.victor.stinner@haypocalc.com> Message-ID: <4E5C813E.9080301@v.loewis.de> > I don't compare ASCII and ISO-8859-1 decoders. I was asking if decoding b'abc' > from ISO-8859-1 is faster than decoding b'ab\xff' from ISO-8859-1, and if yes: > why? No, that makes no difference. > > Your patch replaces PyUnicode_New(size, 255) ... memcpy(), by > PyUnicode_FromUCS1(). You compared to the wrong revision. PyUnicode_New is already a PEP 393 function, and this version you have been comparing to is indeed faster than the current version. However, it is also incorrect, as it fails to compute the maxchar, and hence fails to detect pure-ASCII strings. See below for the actual diff. It should be obvious why the 393 version is faster: 3.3 currently needs to widen each char (to 16 or 32 bits). Regards, Martin @@ -5569,41 +5569,8 @@ Py_ssize_t size, const char *errors) { - PyUnicodeObject *v; - Py_UNICODE *p; - const char *e, *unrolled_end; - /* Latin-1 is equivalent to the first 256 ordinals in Unicode. */ - if (size == 1) { - Py_UNICODE r = *(unsigned char*)s; - return PyUnicode_FromUnicode(&r, 1); - } - - v = _PyUnicode_New(size); - if (v == NULL) - goto onError; - if (size == 0) - return (PyObject *)v; - p = PyUnicode_AS_UNICODE(v); - e = s + size; - /* Unrolling the copy makes it much faster by reducing the looping - overhead. This is similar to what many memcpy() implementations do. */ - unrolled_end = e - 4; - while (s < unrolled_end) { - p[0] = (unsigned char) s[0]; - p[1] = (unsigned char) s[1]; - p[2] = (unsigned char) s[2]; - p[3] = (unsigned char) s[3]; - s += 4; - p += 4; - } - while (s < e) - *p++ = (unsigned char) *s++; - return (PyObject *)v; - - onError: - Py_XDECREF(v); - return NULL; + return PyUnicode_FromUCS1((unsigned char*)s, size); } /* create or adjust a UnicodeEncodeError */ From eliben at gmail.com Tue Aug 30 08:22:31 2011 From: eliben at gmail.com (Eli Bendersky) Date: Tue, 30 Aug 2011 09:22:31 +0300 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <4E5C7BC8.6010302@canterbury.ac.nz> References: <20110829231420.20c3516a@pitrou.net> <4E5C7BC8.6010302@canterbury.ac.nz> Message-ID: On Tue, Aug 30, 2011 at 08:57, Greg Ewing wrote: > Nick Coghlan wrote: > > Personally, I *like* CPython fitting into the "simple-and-portable" >> niche in the Python interpreter space. >> > > Me, too! I like that I can read the CPython source and > understand what it's doing most of the time. Please don't > screw that up by attempting to perform heroic optimisations. > > -- > Following this argument to the extreme, the bytecode evaluation code of CPython can be simplified quite a bit. Lose 2x performance but gain a lot of readability. Does that sound like a good deal? I don't intend to sound sarcastic, just show that IMHO this argument isn't a good one. I think that even clever optimized code can be properly written and *documented* to make the task of understanding it feasible. Personally, I'd love CPython to be a bit faster and see no reason to give up optimization opportunities for the sake of code readability. Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Aug 30 09:58:42 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Aug 2011 17:58:42 +1000 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <4E5C7BC8.6010302@canterbury.ac.nz> Message-ID: On Tue, Aug 30, 2011 at 4:22 PM, Eli Bendersky wrote: > On Tue, Aug 30, 2011 at 08:57, Greg Ewing > wrote: > Following this argument to the extreme, the bytecode evaluation code of > CPython can be simplified quite a bit. Lose 2x performance but gain a lot of > readability. Does that sound like a good deal? I don't intend to sound > sarcastic, just show that IMHO this argument isn't a good one. I think that > even clever optimized code can be properly written and *documented* to make > the task of understanding it feasible. Personally, I'd love CPython to be a > bit faster and see no reason to give up optimization opportunities for the > sake of code readability. Yeah, it's definitely a trade-off - the point I was trying to make is that there *is* a trade-off being made between complexity and speed. I think the computed-gotos stuff struck a nice balance - the macro-fu involved means that you can still understand what the main eval loop is *doing*, even if you don't know exactly what's hidden behind the target macros. Ditto for the older opcode prediction feature and the peephole optimiser - separation of concerns means that you can understand the overall flow of events without needing to understand every little detail. This is where the request to extract individual orthogonal changes and submit separate patches comes from - it makes it clear that the independent changes *can* be separated cleanly, and aren't a giant ball of incomprehensible mud. It's the difference between complex (lots of moving parts, that can each be understood on their own and are then composed into a meaningful whole) and complicated (massive patches that don't work at all if any one component is delayed) Eugene Toder's AST optimiser work that I still hope to get into 3.3 will have to undergo a similar process - the current patch covers a bit too much ground and needs to be broken up into smaller steps before we can seriously consider pushing it into the core. Regards, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From martin at v.loewis.de Tue Aug 30 10:06:26 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 30 Aug 2011 10:06:26 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <20110829225413.689d073c@pitrou.net> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> <4E5BF741.50209@v.loewis.de> <20110829225413.689d073c@pitrou.net> Message-ID: <4E5C9A02.6080704@v.loewis.de> > This looks very nice. Is 3.3 a wide build? (how about a narrow build?) It's a wide build. For reference, I also attach 64-bit narrow build results, and 32-bit results (wide, narrow, and PEP 393). Savings are much smaller in narrow builds (larger on 32-bit systems than on 64-bit systems). > (is it with your own port of Django to py3k, or is there an official > branch for it?) It's https://bitbucket.org/loewis/django-3k Regards, Martin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 3k-32-16.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 3k-32-32.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 3k-64-16.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 393-32.txt URL: From stefan_ml at behnel.de Tue Aug 30 10:15:22 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 30 Aug 2011 10:15:22 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> Message-ID: Nick Coghlan, 30.08.2011 02:00: > On Tue, Aug 30, 2011 at 7:14 AM, Antoine Pitrou wrote: >> On Mon, 29 Aug 2011 11:33:14 -0700 stefan brunthaler wrote: >>> * The optimized dispatch routine has a changed instruction format >>> (word-sized instead of bytecodes) that allows for regular instruction >>> decoding (without the HAS_ARG-check) and inlinind of some objects in >>> the instruction format on 64bit architectures. >> >> Having a word-sized "bytecode" format would probably be acceptable in >> itself, so if you want to submit a patch for that, go ahead. > > Although any such patch should discuss how it compares with Cesare's > work on wpython. > > Personally, I *like* CPython fitting into the "simple-and-portable" > niche in the Python interpreter space. Armin Rigo made the judgment > years ago that CPython was a poor platform for serious optimisation > when he stopped working on Psyco and started PyPy instead, and I think > the contrasting fates of PyPy and Unladen Swallow have borne out that > opinion. Significantly increasing the complexity of CPython for > speed-ups that are dwarfed by those available through PyPy seems like > a poor trade-off to me. If Stefan can cut down his changes into smaller feature chunks, thus making their benefit reproducible and verifiable by others, it's well worth reconsidering if even a visible increase of complexity isn't worth the improved performance, one patch at a time. Even if PyPy's performance tops the improvements, it's worth remembering that that's also a very different kind of system than CPython, with different resource requirements and a different level of maturity, compatibility, portability, etc. There are many reasons to continue using CPython, not only in corners, and there are many people who would be happy about a faster CPython. Raising the bar has its virtues. That being said, I also second Nick's reference to wpython. If CPython grows its byte code size anyway (which, as I understand, is one part of the proposed changes), it's worth looking at wpython first, given that it has been around and working for a while. The other proposed changes sound like at least some of them are independent from this one. Stefan From mark at hotpy.org Tue Aug 30 10:31:12 2011 From: mark at hotpy.org (Mark Shannon) Date: Tue, 30 Aug 2011 09:31:12 +0100 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> Message-ID: <4E5C9FD0.2040208@hotpy.org> Nick Coghlan wrote: > On Tue, Aug 30, 2011 at 7:14 AM, Antoine Pitrou wrote: >> On Mon, 29 Aug 2011 11:33:14 -0700 >> stefan brunthaler wrote: >>> * The optimized dispatch routine has a changed instruction format >>> (word-sized instead of bytecodes) that allows for regular instruction >>> decoding (without the HAS_ARG-check) and inlinind of some objects in >>> the instruction format on 64bit architectures. >> Having a word-sized "bytecode" format would probably be acceptable in >> itself, so if you want to submit a patch for that, go ahead. > > Although any such patch should discuss how it compares with Cesare's > work on wpython. > > Personally, I *like* CPython fitting into the "simple-and-portable" > niche in the Python interpreter space. CPython has a a large number of micro-optimisations, scattered all of the code base. By removing these and adding large-scale optimisations, like Stephan's, the code base *might* actually get smaller overall (and thus simpler) *and* faster. Of course, CPython must remain portable. [snip] > > At a bare minimum, I don't think any significant changes should be > made under the "it will be faster" justification until the bulk of the > real-world benchmark suite used for speed.pypy.org is available for > Python 3. (Wasn't there a GSoC project about that?) +1 Cheers, Mark. From mark at hotpy.org Tue Aug 30 10:32:02 2011 From: mark at hotpy.org (Mark Shannon) Date: Tue, 30 Aug 2011 09:32:02 +0100 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <4E5BF9F7.9020608@v.loewis.de> References: <4E5BF9F7.9020608@v.loewis.de> Message-ID: <4E5CA002.7010109@hotpy.org> Martin v. L?wis wrote: >> So, the two big issues aside, is there any interest in incorporating >> these optimizations in Python 3? > > The question really is whether this is an all-or-nothing deal. If you > could identify smaller parts that can be applied independently, interest > would be higher. > > Also, I'd be curious whether your techniques help or hinder a potential > integration of a JIT generator. A JIT compiler is not a silver bullet, translation to machine code is just one of many optimisations performed by PyPy. A compiler merely removes interpretative overhead, at the cost of significantly increased code size, whereas Stephan's work attacks both interpreter overhead and some of the inefficiencies due to dynamic typing. If Unladen Swallow achieved anything it was to demonstrate that a JIT alone does not work well. My (experimental) HotPy VM has similar base-line speed to CPython, yet is able to outperform Unladen Swallow using interpreter-only optimisations. (It goes even faster with the compiler turned on :) ) Cheers, Mark. > > Regards, > Martin > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org From martin at v.loewis.de Tue Aug 30 10:40:16 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 30 Aug 2011 10:40:16 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <20110830025510.638b41d9@pitrou.net> References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> Message-ID: <4E5CA1F0.2070005@v.loewis.de> >> Although any such patch should discuss how it compares with Cesare's >> work on wpython. >> Personally, I *like* CPython fitting into the "simple-and-portable" >> niche in the Python interpreter space. > > Changing the bytecode width wouldn't make the interpreter more complex. No, but I think Stefan is proposing to add a *second* byte code format, in addition to the one that remains there. That would certainly be an increase in complexity. > Some years ago we were waiting for Unladen Swallow to improve itself > and be ported to Python 3. Now it seems we are waiting for PyPy to be > ported to Python 3. I'm not sure how "let's just wait" is a good > trade-off if someone proposes interesting patches (which, of course, > remains to be seen). I completely agree. Let's not put unmet preconditions to such projects. For example, I still plan to write a JIT for Python at some point. This may happen in two months, or in two years. I wouldn't try to stop anybody from contributing improvements that may become obsolete with the JIT. The only recent case where I *did* try to stop people is with PEP-393, where I do believe that some of the changes that had been made over the last year become redundant. Regards, Martin From martin at v.loewis.de Tue Aug 30 10:46:22 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 30 Aug 2011 10:46:22 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: <4E5C7B48.5080402@canterbury.ac.nz> References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> <4E5C7B48.5080402@canterbury.ac.nz> Message-ID: <4E5CA35E.8000509@v.loewis.de> > You might be reading more into that statement than I meant. > You have to supply Pyrex/Cython versions of the C declarations, > either hand-written or generated by a tool. But you write them > based on the advertised C API -- you don't have to manually > expand macros, work out the low-level layout of structs, or > anything like that (as you often have to do when using ctypes). I can understand how that works when building a CPython extension. But what about creating Jython/IronPython modules with Cython? At what point get the header files considered there? Regards, Martin From stefan_ml at behnel.de Tue Aug 30 10:57:22 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 30 Aug 2011 10:57:22 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: <4E5CA35E.8000509@v.loewis.de> References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> <4E5C7B48.5080402@canterbury.ac.nz> <4E5CA35E.8000509@v.loewis.de> Message-ID: "Martin v. L?wis", 30.08.2011 10:46: >> You might be reading more into that statement than I meant. >> You have to supply Pyrex/Cython versions of the C declarations, >> either hand-written or generated by a tool. But you write them >> based on the advertised C API -- you don't have to manually >> expand macros, work out the low-level layout of structs, or >> anything like that (as you often have to do when using ctypes). > > I can understand how that works when building a CPython extension. > But what about creating Jython/IronPython modules with Cython? > At what point get the header files considered there? I had written a bit about this here: http://thread.gmane.org/gmane.comp.python.devel/126340/focus=126419 Stefan From ncoghlan at gmail.com Tue Aug 30 12:55:51 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Aug 2011 20:55:51 +1000 Subject: [Python-Dev] Planned PEP status changes In-Reply-To: References: Message-ID: On Sat, Aug 27, 2011 at 2:35 AM, Brett Cannon wrote: > On Tue, Aug 23, 2011 at 19:42, Nick Coghlan wrote: >> Unless I hear any objections, I plan to adjust the current PEP >> statuses as follows some time this weekend: >> >> Move from Accepted to Finished: >> >> ? ?389 ?argparse - New Command Line Parsing Module ? ? ? ? ? ? ?Bethard >> ? ?391 ?Dictionary-Based Configuration For Logging ? ? ? ? ? ? ?Sajip >> ? ?3108 ?Standard Library Reorganization ? ? ? ? ? ? ? ? ? ? ? ? Cannon > > I had always hoped to get profile/cProfile taken care of, but > obviously that just didn't ever happen. So no objection, just a slight > sting from the reminder of why the PEP was left open. After starting to write a justification for marking the PEP as Final despite the outstanding TODO items, I realised that didn't make a lot of sense, so I left it at Accepted instead. So your call if you want to say "not gonna happen" and close it out anyway. I made the other 4 changes though (argparse, logging.dictConfig, new super -> Final, Unladen Swallow -> Withdrawn). Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From solipsis at pitrou.net Tue Aug 30 13:33:23 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Aug 2011 13:33:23 +0200 Subject: [Python-Dev] PEP 393 review In-Reply-To: <4E5C9A02.6080704@v.loewis.de> References: <4E553FBC.7080501@v.loewis.de> <20110824203228.3e00874d@pitrou.net> <4E5606C7.9000404@v.loewis.de> <4E577589.4030809@v.loewis.de> <4E5BF741.50209@v.loewis.de> <20110829225413.689d073c@pitrou.net> <4E5C9A02.6080704@v.loewis.de> Message-ID: <20110830133323.13842072@pitrou.net> By the way, I don't know if you're working on it, but StringIO seems a bit broken right now. test_memoryio crashes here: test_newline_cr (test.test_memoryio.CStringIOTest) ... Fatal Python error: Segmentation fault Current thread 0x00007f3f6353b700: File "/home/antoine/cpython/pep-393/Lib/test/test_memoryio.py", line 583 in test_newline_cr File "/home/antoine/cpython/pep-393/Lib/unittest/case.py", line 386 in _executeTestPart File "/home/antoine/cpython/pep-393/Lib/unittest/case.py", line 441 in run File "/home/antoine/cpython/pep-393/Lib/unittest/case.py", line 493 in __call__ File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/pep-393/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/pep-393/Lib/unittest/runner.py", line 168 in run File "/home/antoine/cpython/pep-393/Lib/test/support.py", line 1293 in _run_suite File "/home/antoine/cpython/pep-393/Lib/test/support.py", line 1327 in run_unittest File "/home/antoine/cpython/pep-393/Lib/test/test_memoryio.py", line 718 in test_main File "/home/antoine/cpython/pep-393/Lib/test/regrtest.py", line 1139 in runtest_inner File "/home/antoine/cpython/pep-393/Lib/test/regrtest.py", line 915 in runtest File "/home/antoine/cpython/pep-393/Lib/test/regrtest.py", line 707 in main File "/home/antoine/cpython/pep-393/Lib/test/__main__.py", line 13 in File "/home/antoine/cpython/pep-393/Lib/runpy.py", line 73 in _run_code File "/home/antoine/cpython/pep-393/Lib/runpy.py", line 160 in _run_module_as_main Erreur de segmentation (core dumped) And here's an excerpt of the C stack: #0 find_control_char (translated=0, universal=0, readnl=, kind=4, start=0xa75cf4 "c", end= 0xa75d00 "", consumed=0x7fffffffab38) at ./Modules/_io/textio.c:1617 #1 _PyIO_find_line_ending (translated=0, universal=0, readnl=, kind=4, start=0xa75cf4 "c", end= 0xa75d00 "", consumed=0x7fffffffab38) at ./Modules/_io/textio.c:1678 #2 0x00000000004ed3be in _stringio_readline (self=0x7ffff291a250) at ./Modules/_io/stringio.c:271 #3 stringio_iternext (self=0x7ffff291a250) at ./Modules/_io/stringio.c:322 #4 0x000000000052aa19 in listextend (self=0x7ffff2900ab8, b=) at Objects/listobject.c:844 #5 0x000000000052afe8 in list_init (self=0x7ffff2900ab8, args=, kw=) at Objects/listobject.c:2312 #6 0x00000000004283c7 in type_call (type=, args=(<_io.StringIO at remote 0x7ffff291a250>,), kwds=0x0) at Objects/typeobject.c:692 #7 0x00000000004fdf17 in PyObject_Call (func=, arg=, kw=) at Objects/abstract.c:2147 Regards Antoine. From solipsis at pitrou.net Tue Aug 30 13:38:29 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Aug 2011 13:38:29 +0200 Subject: [Python-Dev] Python 3 optimizations continued... References: <4E5BF9F7.9020608@v.loewis.de> Message-ID: <20110830133829.099d7714@pitrou.net> On Tue, 30 Aug 2011 13:29:59 +1000 Nick Coghlan wrote: > > Anecdotal, non-reproducible performance figures are *not* the way to > go about serious optimisation efforts. What about anecdotal *and* reproducible performance figures? :) I may be half-joking, but we already have a set of py3k-compatible benchmarks and, besides, sometimes a timeit invocation gives a good idea of whether an approach is fruitful or not. While a permanent public reference with historical tracking of performance figures is even better, let's not freeze everything until it's ready. (for example, do we need to wait for speed.python.org before PEP 393 is accepted?) Regards Antoine. From ncoghlan at gmail.com Tue Aug 30 15:05:06 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 30 Aug 2011 23:05:06 +1000 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <20110830133829.099d7714@pitrou.net> References: <4E5BF9F7.9020608@v.loewis.de> <20110830133829.099d7714@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 9:38 PM, Antoine Pitrou wrote: > On Tue, 30 Aug 2011 13:29:59 +1000 > Nick Coghlan wrote: >> >> Anecdotal, non-reproducible performance figures are *not* the way to >> go about serious optimisation efforts. > > What about anecdotal *and* reproducible performance figures? :) > I may be half-joking, but we already have a set of py3k-compatible > benchmarks and, besides, sometimes a timeit invocation gives a good > idea of whether an approach is fruitful or not. > While a permanent public reference with historical tracking of > performance figures is even better, let's not freeze everything until > it's ready. > (for example, do we need to wait for speed.python.org before PEP 393 is > accepted?) Yeah, I'd neglected the idea of just running perf.py for pre- and post-patch performance comparisons. You're right that that can generate sufficient info to make a well-informed decision. I'd still really like it if some of the people advocating that we care about CPython performance actually volunteered to spearhead the effort to get speed.python.org up and running, though. As far as I know, the hardware's spinning idly waiting to be given work to do :P Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From vinay_sajip at yahoo.co.uk Tue Aug 30 15:09:19 2011 From: vinay_sajip at yahoo.co.uk (Vinay Sajip) Date: Tue, 30 Aug 2011 13:09:19 +0000 (UTC) Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: Meador Inge gmail.com> writes: > 1. http://bugs.python.org/issue9041 I raised a question about this patch (in the issue tracker). > 2. http://bugs.python.org/issue9651 > 3. http://bugs.python.org/issue11241 I presume, since Amaury has commit rights, that he could commit these. Regards, Vinay Sajip From riscutiavlad at gmail.com Tue Aug 30 16:27:08 2011 From: riscutiavlad at gmail.com (Vlad Riscutia) Date: Tue, 30 Aug 2011 07:27:08 -0700 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5909FD.7060809@v.loewis.de> <20110827174057.6c4b619e@pitrou.net> <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> Message-ID: I also have some patches sitting on the tracker for some time: http://bugs.python.org/issue12764 http://bugs.python.org/issue11835 http://bugs.python.org/issue12528 which also fixes http://bugs.python.org/issue6069 and http://bugs.python.org/issue11920 http://bugs.python.org/issue6068 which also fixes http://bugs.python.org/issue6493 Thank you, Vlad On Tue, Aug 30, 2011 at 6:09 AM, Vinay Sajip wrote: > Meador Inge gmail.com> writes: > > > > 1. http://bugs.python.org/issue9041 > > I raised a question about this patch (in the issue tracker). > > > 2. http://bugs.python.org/issue9651 > > 3. http://bugs.python.org/issue11241 > > I presume, since Amaury has commit rights, that he could commit these. > > Regards, > > Vinay Sajip > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/riscutiavlad%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Tue Aug 30 17:20:16 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Aug 2011 17:20:16 +0200 Subject: [Python-Dev] cpython: Remove display options (--name, etc.) from the Distribution class. References: Message-ID: <20110830172016.01999c5f@pitrou.net> On Tue, 30 Aug 2011 16:22:14 +0200 eric.araujo wrote: > http://hg.python.org/cpython/rev/af0bcccb935b > changeset: 72127:af0bcccb935b > user: ?ric Araujo > date: Tue Aug 30 00:55:02 2011 +0200 > summary: > Remove display options (--name, etc.) from the Distribution class. > > These options were used to implement ?setup.py --name?, > ?setup.py --version?, etc. which are now handled by the pysetup metadata > action or direct parsing of the setup.cfg file. > > As a side effect, the Distribution class no longer accepts a 'url' key > in its *attrs* argument: it has to be 'home-page' to be recognized as a > valid metadata field and passed down to the dist.metadata object. I don't want to sound nitpicky, but it's the first time I see "home-page" hyphenized. How about "homepage"? Regards Antoine. From stefan at brunthaler.net Tue Aug 30 17:27:13 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 08:27:13 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <4E5CA1F0.2070005@v.loewis.de> References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: >> Changing the bytecode width wouldn't make the interpreter more complex. > > No, but I think Stefan is proposing to add a *second* byte code format, > in addition to the one that remains there. That would certainly be an > increase in complexity. > Yes, indeed I have a more straightforward instruction format to allow for more efficient decoding. Just going from bytecode size to word-code size without changing the instruction format is going to require 8 (or word-size) times more memory on a 64bit system. From an optimization perspective, the irregular instruction format was the biggest problem, because checking for HAS_ARG is always on the fast path and mostly unpredictable. Hence, I chose to extend the instruction format to have word-size and use the additional space to have the upper half be used for the argument and the lower half for the actual opcode. Encoding is more efficient, and *not* more complex. Using profiling to indicate what code is hot, I don't waste too much memory on encoding this regular instruction format. > For example, I still plan to write a JIT for Python at some point. This > may happen in two months, or in two years. I wouldn't try to stop > anybody from contributing improvements that may become obsolete with the > JIT. > I would not necessary argue that at least my optimizations would become obsolete; if you still think about writing a JIT, it might make sense to re-use what I've got and not start from scratch, e.g., building a simple JIT compiler that just inlines the operation implementations as templates to eliminate the interpretative overhead (in similar vein as Piumarta and Riccardi's paper from 1998) might be good start. Thoug I don't want to pre-influence your JIT design, I'm just thinking out loud... Regards, --stefan From phd at phdru.name Tue Aug 30 17:34:58 2011 From: phd at phdru.name (Oleg Broytman) Date: Tue, 30 Aug 2011 19:34:58 +0400 Subject: [Python-Dev] PyPI went down In-Reply-To: <20110830153001.GA13312@iskra.aviel.ru> References: <20110830153001.GA13312@iskra.aviel.ru> Message-ID: <20110830153458.GB13312@iskra.aviel.ru> On Tue, Aug 30, 2011 at 07:30:01PM +0400, Oleg Broytman wrote: > PyPI went down More information: ports 80 and 443 are open, the servers performs SSL handshake but timeouts on HTTP requests (with or without SSL). Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From phd at phdru.name Tue Aug 30 17:40:46 2011 From: phd at phdru.name (Oleg Broytman) Date: Tue, 30 Aug 2011 19:40:46 +0400 Subject: [Python-Dev] PyPI went down In-Reply-To: <20110830153458.GB13312@iskra.aviel.ru> References: <20110830153001.GA13312@iskra.aviel.ru> <20110830153458.GB13312@iskra.aviel.ru> Message-ID: <20110830154046.GC13312@iskra.aviel.ru> It is back up. I am very sorry for the fuss. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From merwok at netwok.org Tue Aug 30 17:34:02 2011 From: merwok at netwok.org (=?UTF-8?B?w4lyaWMgQXJhdWpv?=) Date: Tue, 30 Aug 2011 17:34:02 +0200 Subject: [Python-Dev] cpython: Remove display options (--name, etc.) from the Distribution class. In-Reply-To: <20110830172016.01999c5f@pitrou.net> References: <20110830172016.01999c5f@pitrou.net> Message-ID: <4E5D02EA.2070800@netwok.org> Hi, Le 30/08/2011 17:20, Antoine Pitrou a ?crit : > On Tue, 30 Aug 2011 16:22:14 +0200 > eric.araujo wrote: >> As a side effect, the Distribution class no longer accepts a 'url' key >> in its *attrs* argument: it has to be 'home-page' to be recognized as a >> valid metadata field and passed down to the dist.metadata object. > > I don't want to sound nitpicky, but it's the first time I see > "home-page" hyphenized. How about "homepage"? This value is defined in the accepted Metadata PEPs, which use home-page. Regards From phd at phdru.name Tue Aug 30 17:30:01 2011 From: phd at phdru.name (Oleg Broytman) Date: Tue, 30 Aug 2011 19:30:01 +0400 Subject: [Python-Dev] PyPI went down Message-ID: <20110830153001.GA13312@iskra.aviel.ru> Hello! I released the first package of two and PyPI went down while I was preparing to release the second. I hope it wasn't me? Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From guido at python.org Tue Aug 30 18:42:09 2011 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Aug 2011 09:42:09 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: Stefan, have you shared a pointer to your code yet? Is it open source? It sounds like people are definitely interested and it would make sense to let them experiment with your code and review it. -- --Guido van Rossum (python.org/~guido) From martin at v.loewis.de Tue Aug 30 18:46:53 2011 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 30 Aug 2011 18:46:53 +0200 Subject: [Python-Dev] PyPI went down In-Reply-To: <20110830153001.GA13312@iskra.aviel.ru> References: <20110830153001.GA13312@iskra.aviel.ru> Message-ID: <4E5D13FD.1050107@v.loewis.de> > I released the first package of two and PyPI went down while I was > preparing to release the second. I hope it wasn't me? A few minutes ago, it was responding very slowly, and I found out that Postgres consumes all time. I haven't put energy into investigating what was causing this - apparently, somebody was throwing odd queries at it. Restarting Apache reduced the load. If they continue to do so, I investigate further. Regards, Martin From martin at v.loewis.de Tue Aug 30 18:49:15 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 30 Aug 2011 18:49:15 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> <4E5C7B48.5080402@canterbury.ac.nz> <4E5CA35E.8000509@v.loewis.de> Message-ID: <4E5D148B.1060606@v.loewis.de> >> I can understand how that works when building a CPython extension. >> But what about creating Jython/IronPython modules with Cython? >> At what point get the header files considered there? > > I had written a bit about this here: > > http://thread.gmane.org/gmane.comp.python.devel/126340/focus=126419 I see. So there is potential for error there. Regards, Martin From thomas at python.org Tue Aug 30 18:55:47 2011 From: thomas at python.org (Thomas Wouters) Date: Tue, 30 Aug 2011 18:55:47 +0200 Subject: [Python-Dev] PyPI went down In-Reply-To: <4E5D13FD.1050107@v.loewis.de> References: <20110830153001.GA13312@iskra.aviel.ru> <4E5D13FD.1050107@v.loewis.de> Message-ID: On Tue, Aug 30, 2011 at 18:46, "Martin v. L?wis" wrote: > > I released the first package of two and PyPI went down while I was > > preparing to release the second. I hope it wasn't me? > > A few minutes ago, it was responding very slowly, and I found out that > Postgres consumes all time. I haven't put energy into investigating what > was causing this - apparently, somebody was throwing odd queries at it. > Restarting Apache reduced the load. If they continue to do so, I > investigate further. > Looks like the issue keeps popping up. It was slow to respond earlier today, and I keep getting complaints about it (including now.) -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Tue Aug 30 19:05:12 2011 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Aug 2011 10:05:12 -0700 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: <4E5D148B.1060606@v.loewis.de> References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> <4E5C7B48.5080402@canterbury.ac.nz> <4E5CA35E.8000509@v.loewis.de> <4E5D148B.1060606@v.loewis.de> Message-ID: On Tue, Aug 30, 2011 at 9:49 AM, "Martin v. L?wis" wrote: >>> I can understand how that works when building a CPython extension. >>> But what about creating Jython/IronPython modules with Cython? >>> At what point get the header files considered there? >> >> I had written a bit about this here: >> >> http://thread.gmane.org/gmane.comp.python.devel/126340/focus=126419 > > I see. So there is potential for error there. To elaborate, with CPython it looks pretty solid, at least for functions and constants (does it do structs?). You must manually declare the name and signature of a function, and Pyrex/Cython emits C code that includes the header and calls the function with the appropriate types. If the signature you declare doesn't match what's in the .h file you'll get a compiler error when the C code is compiled. If (perhaps on some platforms) the function is really a macro, the macro in the .h file will be invoked and the right thing will happen. So far so good. The problem lies with the PyPy backend -- there it generates ctypes code, which means that the signature you declare to Cython/Pyrex must match the *linker* level API, not the C compiler level API. Thus, if in a system header a certain function is really a macro that invokes another function with a permuted or augmented argument list, you'd have to know what that macro does. I also don't see how this would work for #defined constants: where does Cython/Pyrex get their value? ctypes doesn't have their values. So, for PyPy, a solution based on Cython/Pyrex has many of the same downsides as one based on ctypes where it comes to complying with an API defined by a .h file. -- --Guido van Rossum (python.org/~guido) From stephen at xemacs.org Tue Aug 30 19:22:25 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 31 Aug 2011 02:22:25 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <20110829141440.2e2178c6@pitrou.net> References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> Message-ID: <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> Antoine Pitrou writes: > On Mon, 29 Aug 2011 12:43:24 +0900 > "Stephen J. Turnbull" wrote: > > > > Since when can s[0] represent a code point outside the BMP, for s a > > Unicode string in a narrow build? > > > > Remember, the UCS-2/narrow vs. UCS-4/wide distinction is *not* about > > what Python supports vs. the outside world. It's about what the str/ > > unicode type is an array of. > > Why would that be? Because what the outside world sees is produced by codecs, not by str. The outside world can't see whether you have narrow or wide unless it uses indexing ... ie, experiments to determine what the str type is an array of. The problem with a narrow build (whether for space efficiency in CPython or for platform compatibility in Jython and IronPython) is not that we have no UTF-16 codecs. It's that array ops aren't UTF-16 conformant. From martin at v.loewis.de Tue Aug 30 19:17:35 2011 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 30 Aug 2011 19:17:35 +0200 Subject: [Python-Dev] PyPI went down In-Reply-To: References: <20110830153001.GA13312@iskra.aviel.ru> <4E5D13FD.1050107@v.loewis.de> Message-ID: <4E5D1B2F.5060402@v.loewis.de> > Looks like the issue keeps popping up. It was slow to respond earlier > today, and I keep getting complaints about it (including now.) Somebody is mirroring the site with wget. I have null-routed them. Regards, Martin From solipsis at pitrou.net Tue Aug 30 19:19:46 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Aug 2011 19:19:46 +0200 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1314724786.3554.1.camel@localhost.localdomain> > The problem with a narrow build (whether for space efficiency in > CPython or for platform compatibility in Jython and IronPython) is not > that we have no UTF-16 codecs. It's that array ops aren't UTF-16 > conformant. Sorry, what is a conformant UTF-16 array op? Thanks Antoine. From stefan at brunthaler.net Tue Aug 30 19:23:34 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 10:23:34 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: On Tue, Aug 30, 2011 at 09:42, Guido van Rossum wrote: > Stefan, have you shared a pointer to your code yet? Is it open source? > I have no shared code repository, but could create one (is there any pydev preferred provider?). I have all the copyrights on the code, and I would like to open-source it. > It sounds like people are definitely interested and it would make > sense to let them experiment with your code and review it. > That sounds fine. I need to do some clean up work (contains most of my comments to remind me of issues) and currently does not pass all regression tests. But if people want to take a look first to decide if they want it than that's good enough for me. (I just wanted to know if there is substantial interest so that it eventually pays off to find and fix the remaining bugs) --stefan From solipsis at pitrou.net Tue Aug 30 19:38:06 2011 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 30 Aug 2011 19:38:06 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: <20110830193806.0d718a56@pitrou.net> On Tue, 30 Aug 2011 08:27:13 -0700 stefan brunthaler wrote: > >> Changing the bytecode width wouldn't make the interpreter more complex. > > > > No, but I think Stefan is proposing to add a *second* byte code format, > > in addition to the one that remains there. That would certainly be an > > increase in complexity. > > > Yes, indeed I have a more straightforward instruction format to allow > for more efficient decoding. Just going from bytecode size to > word-code size without changing the instruction format is going to > require 8 (or word-size) times more memory on a 64bit system. Do you really need it to match a machine word? Or is, say, a 16-bit format sufficient. Regards Antoine. From stefan at brunthaler.net Tue Aug 30 19:50:01 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 10:50:01 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <20110830193806.0d718a56@pitrou.net> References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: > Do you really need it to match a machine word? Or is, say, a 16-bit > format sufficient. > Hm, technically no, but practically it makes more sense, as (at least for x86 architectures) having opargs and opcodes in half-words can be efficiently expressed in assembly. On 64bit architectures, I could also inline data object references that fit into the 32bit upper half. It turns out that most constant objects fit nicely into this, and I have used this for a special cache region (again below 2^32) for global objects, too. So, technically it's not necessary, but practically it makes a lot of sense. (Most of these things work on 32bit systems, too. For architectures with a smaller size, we can adapt or disable the optimizations.) Cheers, --stefan From guido at python.org Tue Aug 30 20:12:13 2011 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Aug 2011 11:12:13 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 10:50 AM, stefan brunthaler wrote: >> Do you really need it to match a machine word? Or is, say, a 16-bit >> format sufficient. >> > Hm, technically no, but practically it makes more sense, as (at least > for x86 architectures) having opargs and opcodes in half-words can be > efficiently expressed in assembly. On 64bit architectures, I could > also inline data object references that fit into the 32bit upper half. > It turns out that most constant objects fit nicely into this, and I > have used this for a special cache region (again below 2^32) for > global objects, too. So, technically it's not necessary, but > practically it makes a lot of sense. (Most of these things work on > 32bit systems, too. For architectures with a smaller size, we can > adapt or disable the optimizations.) Do I sense that the bytecode format is no longer platform-independent? That will need a bit of discussion. I bet there are some things around that depend on that. -- --Guido van Rossum (python.org/~guido) From tjreedy at udel.edu Tue Aug 30 20:15:32 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 30 Aug 2011 14:15:32 -0400 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> <4E5C7B48.5080402@canterbury.ac.nz> <4E5CA35E.8000509@v.loewis.de> <4E5D148B.1060606@v.loewis.de> Message-ID: On 8/30/2011 1:05 PM, Guido van Rossum wrote: >> I see. So there is potential for error there. > > To elaborate, with CPython it looks pretty solid, at least for > functions and constants (does it do structs?). You must manually > declare the name and signature of a function, and Pyrex/Cython emits C > code that includes the header and calls the function with the > appropriate types. If the signature you declare doesn't match what's > in the .h file you'll get a compiler error when the C code is > compiled. If (perhaps on some platforms) the function is really a > macro, the macro in the .h file will be invoked and the right thing > will happen. So far so good. > > The problem lies with the PyPy backend -- there it generates ctypes > code, which means that the signature you declare to Cython/Pyrex must > match the *linker* level API, not the C compiler level API. Thus, if > in a system header a certain function is really a macro that invokes > another function with a permuted or augmented argument list, you'd > have to know what that macro does. I also don't see how this would > work for #defined constants: where does Cython/Pyrex get their value? > ctypes doesn't have their values. > > So, for PyPy, a solution based on Cython/Pyrex has many of the same > downsides as one based on ctypes where it comes to complying with an > API defined by a .h file. Thank you for this elaboration. My earlier comment that ctypes seems to be hard to use was based on observation of posts to python-list presenting failed attempts (which have included somehow getting function signatures wrong) and a sense that ctypes was somehow bypassing the public compiler API to make a more direct access via some private api. You have explained and named that as the 'linker API', so I understand much better now. Nothing like 'linker API' or 'signature' appears in the ctypes doc. All I could find about discovering specific function calling conventions is "To find out the correct calling convention you have to look into the C header file or the documentation for the function you want to call." Perhaps that should be elaborated to explain, as you did above, the need to trace macro definitions to find the actual calling convention and the need to be aware that macro definitions can change to accommodate implementation detail changes even as the surface calling conventions seems to remain the same. -- Terry Jan Reedy From stefan at brunthaler.net Tue Aug 30 20:23:56 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 11:23:56 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: > Do I sense that the bytecode format is no longer platform-independent? > That will need a bit of discussion. I bet there are some things around > that depend on that. > Hm, I haven't really thought about that in detail and for longer, I ran it on PowerPC 970 and Intel Atom & i7 without problems (the latter ones are a non-issue) and think that it can be portable. I just stuff argument and opcode into one word for regular instruction decoding like a RISC CPU, and I realize there might be little/big endian issues, but they surely can be conditionally compiled... --stefan From guido at python.org Tue Aug 30 20:27:56 2011 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Aug 2011 11:27:56 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 11:23 AM, stefan brunthaler wrote: >> Do I sense that the bytecode format is no longer platform-independent? >> That will need a bit of discussion. I bet there are some things around >> that depend on that. >> > Hm, I haven't really thought about that in detail and for longer, I > ran it on PowerPC 970 and Intel Atom & i7 without problems (the latter > ones are a non-issue) and think that it can be portable. I just stuff > argument and opcode into one word for regular instruction decoding > like a RISC CPU, and I realize there might be little/big endian > issues, but they surely can be conditionally compiled... Um, I'm sorry, but that reply sounds incredibly naive, like you're not really sure what the on-disk format for .pyc files is or why it would matter. You're not even answering the question, except indirectly -- it seems that you've never even thought about the possibility of generating a .pyc file on one platform and copying it to a computer using a different one. -- --Guido van Rossum (python.org/~guido) From stefan at brunthaler.net Tue Aug 30 20:34:08 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 11:34:08 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: > Um, I'm sorry, but that reply sounds incredibly naive, like you're not > really sure what the on-disk format for .pyc files is or why it would > matter. You're not even answering the question, except indirectly -- > it seems that you've never even thought about the possibility of > generating a .pyc file on one platform and copying it to a computer > using a different one. > Well, it may sound incredibly naive, but the truth is: I am never storing the optimized representation to disk, it's done purely at runtime when profiling tells me it makes sense to make the switch. Thus I circumvent many of the problems outlined by you. So I am positive that a full fledged change of the representation has many more intricacies to it, but my approach is only tangentially related... --stefan From tjreedy at udel.edu Tue Aug 30 20:41:05 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 30 Aug 2011 14:41:05 -0400 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: On 8/30/2011 1:23 PM, stefan brunthaler wrote: > (I just wanted to know if there is substantial interest so that > it eventually pays off to find and fix the remaining bugs) It is the nature of our development process that there usually can be no guarantee of acceptance of future code. The rather early acceptance of Unladen Swallow was to me something of an anomaly. I also think it was something of a mistake insofar as it discouraged other efforts, like yours. I think the answer you have gotten is that there is a) substantial interest and b) a willingness to consider a major change such as switfing from bytecode to something else. There also seem to be two main concerns: 1) that the increase in complexity be 'less' than the increase in speed, and 2) that the changes be presented in small enough chunks that they can be reviewed. Whether this is good enough for you to proceed is for you to decide. -- Terry Jan Reedy From guido at python.org Tue Aug 30 20:43:35 2011 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Aug 2011 11:43:35 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 11:34 AM, stefan brunthaler wrote: >> Um, I'm sorry, but that reply sounds incredibly naive, like you're not >> really sure what the on-disk format for .pyc files is or why it would >> matter. You're not even answering the question, except indirectly -- >> it seems that you've never even thought about the possibility of >> generating a .pyc file on one platform and copying it to a computer >> using a different one. > Well, it may sound incredibly naive, but the truth is: I am never > storing the optimized representation to disk, it's done purely at > runtime when profiling tells me it makes sense to make the switch. > Thus I circumvent many of the problems outlined by you. So I am > positive that a full fledged change of the representation has many > more intricacies to it, but my approach is only tangentially > related... Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)? -- --Guido van Rossum (python.org/~guido) From g.brandl at gmx.net Tue Aug 30 22:01:29 2011 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 30 Aug 2011 22:01:29 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: Am 30.08.2011 20:34, schrieb stefan brunthaler: >> Um, I'm sorry, but that reply sounds incredibly naive, like you're not >> really sure what the on-disk format for .pyc files is or why it would >> matter. You're not even answering the question, except indirectly -- >> it seems that you've never even thought about the possibility of >> generating a .pyc file on one platform and copying it to a computer >> using a different one. >> > Well, it may sound incredibly naive, but the truth is: I am never > storing the optimized representation to disk, it's done purely at > runtime when profiling tells me it makes sense to make the switch. > Thus I circumvent many of the problems outlined by you. So I am > positive that a full fledged change of the representation has many > more intricacies to it, but my approach is only tangentially > related... You know, instead of all these half-explanations, giving us access to the code would shut us up much more effectively. Don't worry about not passing tests, this is what the official trunk does half of the time ;) Georg From stefan_ml at behnel.de Tue Aug 30 22:03:12 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 30 Aug 2011 22:03:12 +0200 Subject: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3) In-Reply-To: References: <4E5951D5.5020200@v.loewis.de> <20110828002642.4765fc89@pitrou.net> <20110828012705.523e51d4@pitrou.net> <4E5C01E4.2050106@canterbury.ac.nz> <4E5C7B48.5080402@canterbury.ac.nz> <4E5CA35E.8000509@v.loewis.de> <4E5D148B.1060606@v.loewis.de> Message-ID: Guido van Rossum, 30.08.2011 19:05: > On Tue, Aug 30, 2011 at 9:49 AM, "Martin v. L?wis" wrote: >>>> I can understand how that works when building a CPython extension. >>>> But what about creating Jython/IronPython modules with Cython? >>>> At what point get the header files considered there? >>> >>> I had written a bit about this here: >>> >>> http://thread.gmane.org/gmane.comp.python.devel/126340/focus=126419 >> >> I see. So there is potential for error there. > > To elaborate, with CPython it looks pretty solid, at least for > functions and constants (does it do structs?). Sure. They even coerce from Python dicts and accept keyword arguments in Cython. > You must manually > declare the name and signature of a function, and Pyrex/Cython emits C > code that includes the header and calls the function with the > appropriate types. If the signature you declare doesn't match what's > in the .h file you'll get a compiler error when the C code is > compiled. If (perhaps on some platforms) the function is really a > macro, the macro in the .h file will be invoked and the right thing > will happen. So far so good. Right. > The problem lies with the PyPy backend -- there it generates ctypes > code, which means that the signature you declare to Cython/Pyrex must > match the *linker* level API, not the C compiler level API. Thus, if > in a system header a certain function is really a macro that invokes > another function with a permuted or augmented argument list, you'd > have to know what that macro does. I also don't see how this would > work for #defined constants: where does Cython/Pyrex get their value? > ctypes doesn't have their values. > > So, for PyPy, a solution based on Cython/Pyrex has many of the same > downsides as one based on ctypes where it comes to complying with an > API defined by a .h file. Right again. The declarations that Cython uses describe the API at the C or C++ level. They do not describe the ABI. So the situation is the same as with ctypes, and the same solutions (or work-arounds) apply, such as generating additional glue code that calls macros or reads compile time constants, for example. That's the approach that the IronPython backend has taken. It's a lot more complex, but also a lot more versatile in the long run. Stefan From arigo at tunes.org Tue Aug 30 22:02:09 2011 From: arigo at tunes.org (Armin Rigo) Date: Tue, 30 Aug 2011 22:02:09 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: Re-hi, 2011/8/29 Armin Rigo : >> The problem is that many locks are actually acquired implicitely. >> For example, `print` to a buffered stream will acquire the fileobject's mutex. > > Indeed. > (...) > I suspect that I need to do a more thorough review of the stdlib (...) I found a solution not involving any change in CPython, and updated the patch. The solution is to say that a "with atomic" block doesn't completely prevent other threads from re-acquiring the GIL, but only prevents them from proceeding to the following bytecode. So if another thread is currently suspended in a place that releases the GIL for other reasons, then this other thread can still be switched to as normal, and continue running until the end of the current bytecode. I think it's sane enough for the original purpose, and avoids most deadlock cases. A bient?t, Armin. From stefan at brunthaler.net Tue Aug 30 22:41:01 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 13:41:01 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: > Ok, there there's something else you haven't told us. Are you saying > that the original (old) bytecode is still used (and hence written to > and read from .pyc files)? > Short answer: yes. Long answer: I added an invocation counter to the code object and keep interpreting in the usual Python interpreter until this counter reaches a configurable threshold. When it reaches this threshold, I create the new instruction format and interpret with this optimized representation. All the macros look exactly the same in the source code, they are just redefined to use the different instruction format. I am at no point serializing this representation or the runtime information gathered by me, as any subsequent invocation might have different characteristics. I will remove my development commentaries and create a private repository at bitbucket for you* to take an early look like Georg (and more or less Terry, too) suggested. Is that a good way for most of you? (I would then give access to whomever wants to take a look.) Best, --stefan *: not personally targeted at Guido (who is naturally very much welcome to take a look, too) but addressed to python-dev in general. From benjamin at python.org Tue Aug 30 22:42:54 2011 From: benjamin at python.org (Benjamin Peterson) Date: Tue, 30 Aug 2011 16:42:54 -0400 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: 2011/8/30 stefan brunthaler : > I will remove my development commentaries and create a private > repository at bitbucket for you* to take an early look like Georg (and > more or less Terry, too) suggested. Is that a good way for most of > you? (I would then give access to whomever wants to take a look.) And what is wrong with a public one? -- Regards, Benjamin From stefan at brunthaler.net Tue Aug 30 22:51:44 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Tue, 30 Aug 2011 13:51:44 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson wrote: > 2011/8/30 stefan brunthaler : >> I will remove my development commentaries and create a private >> repository at bitbucket for you* to take an early look like Georg (and >> more or less Terry, too) suggested. Is that a good way for most of >> you? (I would then give access to whomever wants to take a look.) > > And what is wrong with a public one? > Well, since it does not fully pass all regression tests and is just meant for people to take a first look to find out if it's interesting, I think I might take it offline after you had a look. It seems to me that that is easier to be done with a private repository, but in general, I don't have a problem with a public one... Regards, --stefan PS: If you want to, I can also just put a tarball on my home page and post a link here. It's not that I would like to have control/influence about who is allowed to look and who doesn't. From benjamin at python.org Tue Aug 30 22:54:07 2011 From: benjamin at python.org (Benjamin Peterson) Date: Tue, 30 Aug 2011 16:54:07 -0400 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: 2011/8/30 stefan brunthaler : > On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson wrote: >> 2011/8/30 stefan brunthaler : >>> I will remove my development commentaries and create a private >>> repository at bitbucket for you* to take an early look like Georg (and >>> more or less Terry, too) suggested. Is that a good way for most of >>> you? (I would then give access to whomever wants to take a look.) >> >> And what is wrong with a public one? >> > Well, since it does not fully pass all regression tests and is just > meant for people to take a first look to find out if it's interesting, > I think I might take it offline after you had a look. It seems to me > that that is easier to be done with a private repository, but in > general, I don't have a problem with a public one... Well, if your intention is for people to look at it, public seems to be the best solution. -- Regards, Benjamin From yselivanov.ml at gmail.com Tue Aug 30 23:33:53 2011 From: yselivanov.ml at gmail.com (Yury Selivanov) Date: Tue, 30 Aug 2011 17:33:53 -0400 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: References: Message-ID: <544C8633-8847-4018-875C-2FD093CCD885@gmail.com> Maybe it'd be better to put 'atomic' in the threading module? On 2011-08-30, at 4:02 PM, Armin Rigo wrote: > Re-hi, > > 2011/8/29 Armin Rigo : >>> The problem is that many locks are actually acquired implicitely. >>> For example, `print` to a buffered stream will acquire the fileobject's mutex. >> >> Indeed. >> (...) >> I suspect that I need to do a more thorough review of the stdlib (...) > > I found a solution not involving any change in CPython, and updated > the patch. The solution is to say that a "with atomic" block doesn't > completely prevent other threads from re-acquiring the GIL, but only > prevents them from proceeding to the following bytecode. So if > another thread is currently suspended in a place that releases the GIL > for other reasons, then this other thread can still be switched to as > normal, and continue running until the end of the current bytecode. I > think it's sane enough for the original purpose, and avoids most > deadlock cases. > > > A bient?t, > > Armin. > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/yselivanov.ml%40gmail.com From greg at krypto.org Tue Aug 30 23:47:28 2011 From: greg at krypto.org (Gregory P. Smith) Date: Tue, 30 Aug 2011 14:47:28 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 1:54 PM, Benjamin Peterson wrote: > 2011/8/30 stefan brunthaler : > > On Tue, Aug 30, 2011 at 13:42, Benjamin Peterson > wrote: > >> 2011/8/30 stefan brunthaler : > >>> I will remove my development commentaries and create a private > >>> repository at bitbucket for you* to take an early look like Georg (and > >>> more or less Terry, too) suggested. Is that a good way for most of > >>> you? (I would then give access to whomever wants to take a look.) > >> > >> And what is wrong with a public one? > >> > > Well, since it does not fully pass all regression tests and is just > > meant for people to take a first look to find out if it's interesting, > > I think I might take it offline after you had a look. It seems to me > > that that is easier to be done with a private repository, but in > > general, I don't have a problem with a public one... > > Well, if your intention is for people to look at it, public seems to > be the best solution. > > +1 The point of open source is more eyeballs and the ability for anyone else to pick up code and run in whatever direction they want (license permitting) with it. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacek.pliszka at gmail.com Wed Aug 31 00:35:37 2011 From: jacek.pliszka at gmail.com (Jacek Pliszka) Date: Wed, 31 Aug 2011 00:35:37 +0200 Subject: [Python-Dev] Coding guidelines for os.walk filter Message-ID: Hi! I would like to get some opinion on possible os.walk improvement. For the sake of simplicity let's assume I would like to skip all .svn and tmp directories. Current solution looks like this: for t in os.walk(somedir): t[1][:]=set(t[1])-{'.svn','tmp'} ... do something This is a very clever hack but... it relies on internal implementation of os.walk.... Alternative is adding os.walk parameter e.g. like this: def walk(top, topdown=True, onerror=None, followlinks=False, walkfilter=None) .... if walkfilter is not None: dirs,nondirs=walkfilter(top,dirs,nondirs) ..... and remove .svn and tmp in the walkfilter definition. What I do not like here is that followlinks is redundant - easily implementable through walkfilter Simpler but braking backward-compatibility option would be: def walk(top, topdown=True, onerror=None, skipdirs=islink) ... - if followlinks or not islink(new_path): - for x in walk(new_path, topdown, onerror, followlinks): + if not skipdirs(new_path): + for x in walk(new_path, topdown, onerror, skipdirs): And user given skipdirs function should return true for new_path ending in .svn or tmp Nothing is redundant and works fine with topdown=False! What do you think? Shall we: a) do nothing and use the implicit hack b) make the option explicit with backward compatibility but with redundancy and topdown=False incompatibility c) make the option explicit braking backward compatibility but no redundancy Best Regards, Jacek Pliszka From jnoller at gmail.com Wed Aug 31 01:21:18 2011 From: jnoller at gmail.com (Jesse Noller) Date: Tue, 30 Aug 2011 19:21:18 -0400 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <4E5BF9F7.9020608@v.loewis.de> <20110830133829.099d7714@pitrou.net> Message-ID: On Aug 30, 2011, at 9:05 AM, Nick Coghlan wrote: > On Tue, Aug 30, 2011 at 9:38 PM, Antoine Pitrou wrote: >> On Tue, 30 Aug 2011 13:29:59 +1000 >> Nick Coghlan wrote: >>> >>> Anecdotal, non-reproducible performance figures are *not* the way to >>> go about serious optimisation efforts. >> >> What about anecdotal *and* reproducible performance figures? :) >> I may be half-joking, but we already have a set of py3k-compatible >> benchmarks and, besides, sometimes a timeit invocation gives a good >> idea of whether an approach is fruitful or not. >> While a permanent public reference with historical tracking of >> performance figures is even better, let's not freeze everything until >> it's ready. >> (for example, do we need to wait for speed.python.org before PEP 393 is >> accepted?) > > Yeah, I'd neglected the idea of just running perf.py for pre- and > post-patch performance comparisons. You're right that that can > generate sufficient info to make a well-informed decision. > > I'd still really like it if some of the people advocating that we care > about CPython performance actually volunteered to spearhead the effort > to get speed.python.org up and running, though. As far as I know, the > hardware's spinning idly waiting to be given work to do :P > > Cheers, > Nick. > Discussion of speed.python.org should happen on the mailing list for that project if possible. From ncoghlan at gmail.com Wed Aug 31 01:26:53 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 31 Aug 2011 09:26:53 +1000 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: On Wed, Aug 31, 2011 at 3:23 AM, stefan brunthaler wrote: > On Tue, Aug 30, 2011 at 09:42, Guido van Rossum wrote: >> Stefan, have you shared a pointer to your code yet? Is it open source? >> > I have no shared code repository, but could create one (is there any > pydev preferred provider?). I have all the copyrights on the code, and > I would like to open-source it. Currently, the easiest way to create shared repositories for CPython variants is to start with bitbucket's mirror of the main CPython repo: https://bitbucket.org/mirror/cpython/overview Use the website to create your own public CPython fork, then edit the configuration of your local copy of the CPython repo to point to the your new bitbucket repo rather than the main one on hg.python.org. hg push/pull can then be used as normal to publish in-development material to the world. 'hg pull' from hg.python.org makes it fairly easy to track the trunk. One key thing is to avoid making any changes of your own on the official CPython branches (i.e. default, 3.2, 2.7). Instead, use a named branch for anything you're working on. This makes it much easier to generate standalone patches later on. My own public sandbox (https://bitbucket.org/ncoghlan/cpython_sandbox/overview) is set up that way, and you can see plenty of other examples on bitbucket. Cheers, Nick. -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From ncoghlan at gmail.com Wed Aug 31 01:30:24 2011 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 31 Aug 2011 09:30:24 +1000 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <4E5BF9F7.9020608@v.loewis.de> <20110830133829.099d7714@pitrou.net> Message-ID: On Wed, Aug 31, 2011 at 9:21 AM, Jesse Noller wrote: > Discussion of speed.python.org should happen on the mailing list for that project if possible. Hah, that's how out of the loop I am on that front - I didn't even know there *was* a mailing list for it :) Subscribed! Cheers, Nick. P.S. For anyone else that is interested: http://mail.python.org/mailman/listinfo/speed -- Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia From murman at gmail.com Wed Aug 31 04:21:52 2011 From: murman at gmail.com (Michael Urman) Date: Tue, 30 Aug 2011 21:21:52 -0500 Subject: [Python-Dev] Coding guidelines for os.walk filter In-Reply-To: References: Message-ID: > for t in os.walk(somedir): > ? ?t[1][:]=set(t[1])-{'.svn','tmp'} > ? ?... do something > > This is a very clever hack but... it relies on internal implementation > of os.walk.... This doesn't appear to be an internal implementation detail; this is documented behavior. http://docs.python.org/dev/library/os.html#os.walk shows a similar example: for root, dirs, files in os.walk('python/Lib/email'): # ... dirs.remove('CVS') # don't visit CVS directories -- Michael Urman From stephen at xemacs.org Wed Aug 31 04:55:09 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 31 Aug 2011 11:55:09 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <1314724786.3554.1.camel@localhost.localdomain> References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> Message-ID: <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> Antoine Pitrou writes: > Sorry, what is a conformant UTF-16 array op? For starters, one that doesn't ever return lone surrogates, but rather interprets surrogate pairs as Unicode code points as in UTF-16. (This is not a Unicode standard definition, it's intended to be suggestive of why many app writers will be distressed if they must use Python unicode/str in a narrow build without a fairly comprehensive library that wraps the arrays in operations that treat unicode/str as an array of code points.) From tjreedy at udel.edu Wed Aug 31 05:22:56 2011 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 30 Aug 2011 23:22:56 -0400 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On 8/30/2011 2:12 PM, Guido van Rossum wrote: > On Tue, Aug 30, 2011 at 10:50 AM, stefan brunthaler > wrote: >>> Do you really need it to match a machine word? Or is, say, a 16-bit >>> format sufficient. >>> >> Hm, technically no, but practically it makes more sense, as (at least >> for x86 architectures) having opargs and opcodes in half-words can be >> efficiently expressed in assembly. On 64bit architectures, I could >> also inline data object references that fit into the 32bit upper half. >> It turns out that most constant objects fit nicely into this, and I >> have used this for a special cache region (again below 2^32) for >> global objects, too. So, technically it's not necessary, but >> practically it makes a lot of sense. (Most of these things work on >> 32bit systems, too. For architectures with a smaller size, we can >> adapt or disable the optimizations.) > > Do I sense that the bytecode format is no longer platform-independent? > That will need a bit of discussion. I bet there are some things around > that depend on that. I find myself more comfortable with the Cesare Di Mauro's idea of expanding to 16 bits as the code unit. His basic idea was using 2, 4, or 6 bytes instead of 1, 3, or 6. It actually tended to save space because many ops with small ints (which are very common) contract from 3 bytes to 2 bytes or from 9(?) (two instructions) to 6. I am sorry he was not able to followup on the initial promising results. The dis output was probably easier to read than the current output. Perhaps he made a mistake in combining the above idea with a shift from stack to hybrid stack+register design. -- Terry Jan Reedy From guido at python.org Wed Aug 31 05:35:27 2011 From: guido at python.org (Guido van Rossum) Date: Tue, 30 Aug 2011 20:35:27 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull wrote: > Antoine Pitrou writes: > > ?> Sorry, what is a conformant UTF-16 array op? > > For starters, one that doesn't ever return lone surrogates, but rather > interprets surrogate pairs as Unicode code points as in UTF-16. ?(This > is not a Unicode standard definition, it's intended to be suggestive > of why many app writers will be distressed if they must use Python > unicode/str in a narrow build without a fairly comprehensive library > that wraps the arrays in operations that treat unicode/str as an array > of code points.) That sounds like a contradiction -- it wouldn't be a UTF-16 array if you couldn't tell that it was using UTF-16. -- --Guido van Rossum (python.org/~guido) From cesare.di.mauro at gmail.com Wed Aug 31 07:04:35 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 07:04:35 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: <20110830025510.638b41d9@pitrou.net> References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> Message-ID: 2011/8/30 Antoine Pitrou > Changing the bytecode width wouldn't make the interpreter more complex. It depends on the kind of changes. :) WPython introduced a very different "intermediate code" representation that required a big change on the peepholer optimizer on 1.0 alpha version. On 1.1 final I decided to completely move that code on ast.c (mostly for constant-folding) and compiler.c (for the usual peepholer usage: seeking for some "patterns" to substitute with better ones) because I found it simpler and more convenient. In the end, taking out some new optimizations that I've implemented "on the road", the interpreter code is a bit more complex. > > Some years ago we were waiting for Unladen Swallow to improve itself > and be ported to Python 3. Now it seems we are waiting for PyPy to be > ported to Python 3. I'm not sure how "let's just wait" is a good > trade-off if someone proposes interesting patches (which, of course, > remains to be seen). > > Regards > > Antoine. > > It isn't, because motivation to do something new with CPython vanishes, at least on some areas (virtual machine / ceval.c), even having some ideas to experiment with. That's why in my last talk on EuroPython I decided to move on other areas (Python objects). Regards Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: From cesare.di.mauro at gmail.com Wed Aug 31 07:10:40 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 07:10:40 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <4E5C7BC8.6010302@canterbury.ac.nz> Message-ID: 2011/8/30 Nick Coghlan > > Yeah, it's definitely a trade-off - the point I was trying to make is > that there *is* a trade-off being made between complexity and speed. > > I think the computed-gotos stuff struck a nice balance - the macro-fu > involved means that you can still understand what the main eval loop > is *doing*, even if you don't know exactly what's hidden behind the > target macros. Ditto for the older opcode prediction feature and the > peephole optimiser - separation of concerns means that you can > understand the overall flow of events without needing to understand > every little detail. > > This is where the request to extract individual orthogonal changes and > submit separate patches comes from - it makes it clear that the > independent changes *can* be separated cleanly, and aren't a giant > ball of incomprehensible mud. It's the difference between complex > (lots of moving parts, that can each be understood on their own and > are then composed into a meaningful whole) and complicated (massive > patches that don't work at all if any one component is delayed) > > Eugene Toder's AST optimiser work that I still hope to get into 3.3 > will have to undergo a similar process - the current patch covers a > bit too much ground and needs to be broken up into smaller steps > before we can seriously consider pushing it into the core. > > Regards, > Nick. > > Sometimes it cannot be done, because big changes produces big patches as well. I don't see a problem here if the code is well written (as "required" buy the Python community :) and the developer is available to talk about his work to clear some doubts. Regards Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: From cesare.di.mauro at gmail.com Wed Aug 31 07:14:08 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 07:14:08 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> Message-ID: 2011/8/30 stefan brunthaler > Yes, indeed I have a more straightforward instruction format to allow > for more efficient decoding. Just going from bytecode size to > word-code size without changing the instruction format is going to > require 8 (or word-size) times more memory on a 64bit system. From an > optimization perspective, the irregular instruction format was the > biggest problem, because checking for HAS_ARG is always on the fast > path and mostly unpredictable. Hence, I chose to extend the > instruction format to have word-size and use the additional space to > have the upper half be used for the argument and the lower half for > the actual opcode. Encoding is more efficient, and *not* more complex. > Using profiling to indicate what code is hot, I don't waste too much > memory on encoding this regular instruction format. > > Regards, > --stefan > That seems exactly the WPython approach, albeit I used the new "wordcode" in place of the old bytecode. Take a look at it. ;) Regards Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: From cesare.di.mauro at gmail.com Wed Aug 31 07:16:49 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 07:16:49 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: 2011/8/30 stefan brunthaler > > Do I sense that the bytecode format is no longer platform-independent? > > That will need a bit of discussion. I bet there are some things around > > that depend on that. > > > Hm, I haven't really thought about that in detail and for longer, I > ran it on PowerPC 970 and Intel Atom & i7 without problems (the latter > ones are a non-issue) and think that it can be portable. I just stuff > argument and opcode into one word for regular instruction decoding > like a RISC CPU, and I realize there might be little/big endian > issues, but they surely can be conditionally compiled... > > --stefan > I think that you must deal with big endianess because some RISC can't handle at all data in little endian format. In WPython I have wrote some macros which handle both endianess, but lacking big endian machines I never had the opportunity to verify if something was wrong. Regards Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: From cesare.di.mauro at gmail.com Wed Aug 31 07:38:44 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 07:38:44 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: 2011/8/31 Terry Reedy > I find myself more comfortable with the Cesare Di Mauro's idea of expanding > to 16 bits as the code unit. His basic idea was using 2, 4, or 6 bytes > instead of 1, 3, or 6. > It can be expanded to longer than 6 bytes opcodes, if needed. The format is designed to be flexible enough to accommodate such changes without pains. > It actually tended to save space because many ops with small ints (which > are very common) contract from 3 bytes to 2 bytes or from 9(?) (two > instructions) to 6. > It can pack up to 4 (old) opcodes into one wordcode (superinstruction). Wordcodes are designed to favor instruction "grouping". > I am sorry he was not able to followup on the initial promising results. > In a few words: lack of interest. Why spending (so much) time to a project when you see that the community is oriented towards other directions (Unladen Swallow at first, PyPy in the last period, given the substantial drop of the former)? Also, Guido seems to dislike what he finds as "hacks", and never showed interest. In WPython 1.1 I "rolled back" the "hack" that I introduced in PyObject types (a couple of fields) in 1.0 alpha, to make the code more "polished" (but with a sensible drop in the performance). But again, I saw no interest on WPython, so I decided to put a stop at it, and blocking my initial idea to go for Python 3. > The dis output was probably easier to read than the current output. > > Perhaps he made a mistake in combining the above idea with a shift from > stack to hybrid stack+register design. > > -- > Terry Jan Reedy > > As I already said, wordcodes are designed to favor "grouping". So It was quite natural to became an "hybrid" VM. Anyway, both space and performance gained from this wordcodes "property". ;) Regards Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Wed Aug 31 08:03:52 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 31 Aug 2011 15:03:52 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull wrote: > > For starters, one that doesn't ever return lone surrogates, but rather > > interprets surrogate pairs as Unicode code points as in UTF-16. ?(This > > is not a Unicode standard definition, it's intended to be suggestive > > of why many app writers will be distressed if they must use Python > > unicode/str in a narrow build without a fairly comprehensive library > > that wraps the arrays in operations that treat unicode/str as an array > > of code points.) > > That sounds like a contradiction -- it wouldn't be a UTF-16 array if > you couldn't tell that it was using UTF-16. Well, that's why I wrote "intended to be suggestive". The Unicode Standard does not specify at all what the internal representation of characters may be, it only specifies what their external behavior must be when two processes communicate. (For "process" as used in the standard, think "Python modules" here, since we are concerned with the problems of folks who develop in Python.) When observing the behavior of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or even UTF-32 arrays; only arrays of characters. Thus, according to the rules of handling a UTF-16 stream, it is an error to observe a lone surrogate or a surrogate pair that isn't a high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and C8-C10). That's what I mean by "can't tell it's UTF-16". And I understand those requirements to mean that operations on UTF-16 streams should produce UTF-16 streams, or raise an error. Without that closure property for basic operations on str, I think it's a bad idea to say that the representation of text in a str in a pre-PEP-393 "narrow" build is UTF-16. For many users and app developers, it creates expectations that are not fulfilled. It's true that common usage is that an array of code units that usually conforms to UTF-16 may be called "UTF-16" without the closure properties. I just disagree with that usage, because there are two camps that interpret "UTF-16" differently. One side says, "we have an array representation in UTF-16 that can handle all Unicode code points efficiently, and if you think you need more, think again", while the other says "it's too painful to have to check every result for valid UTF-16, and we need a UTF-16 type that supports the usual array operations on *characters* via the usual operators; if you think otherwise, think again." Note that despite the (presumed) resolution of the UTF-16 issue for CPython by PEP 393, at some point a very similar discussion will take place over "characters" anyway, because users and app developers are going to want a type that handles composition sequences and/or grapheme clusters for them, as well as comparison that respects canonical equivalence, even if it is inefficient compared to str. That's why I insisted on use of "array of code points" to describe the PEP 393 str type, rather than "array of characters". From arigo at tunes.org Wed Aug 31 08:43:45 2011 From: arigo at tunes.org (Armin Rigo) Date: Wed, 31 Aug 2011 08:43:45 +0200 Subject: [Python-Dev] Software Transactional Memory for Python In-Reply-To: <544C8633-8847-4018-875C-2FD093CCD885@gmail.com> References: <544C8633-8847-4018-875C-2FD093CCD885@gmail.com> Message-ID: Hi, On Tue, Aug 30, 2011 at 11:33 PM, Yury Selivanov wrote: > Maybe it'd be better to put 'atomic' in the threading module? 'threading' is pure Python. But anyway the consensus is to not have 'atomic' at all in the stdlib, which means it is in its own 3rd-party extension module. Armin From stefan_ml at behnel.de Wed Aug 31 09:26:49 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 31 Aug 2011 09:26:49 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: stefan brunthaler, 30.08.2011 22:41: >> Ok, there there's something else you haven't told us. Are you saying >> that the original (old) bytecode is still used (and hence written to >> and read from .pyc files)? >> > Short answer: yes. > Long answer: I added an invocation counter to the code object and keep > interpreting in the usual Python interpreter until this counter > reaches a configurable threshold. When it reaches this threshold, I > create the new instruction format and interpret with this optimized > representation. All the macros look exactly the same in the source > code, they are just redefined to use the different instruction format. > I am at no point serializing this representation or the runtime > information gathered by me, as any subsequent invocation might have > different characteristics. So, basically, you built a JIT compiler but don't want to call it that, right? Just because it compiles byte code to other byte code rather than to native CPU instructions does not mean it doesn't compile Just In Time. That actually sounds like a nice feature in general. It could even replace (or accompany?) the existing peep hole optimiser as part of a more general optimisation architecture, in the sense that it could apply byte code optimisations at runtime rather than compile time, potentially based on better knowledge about what's actually going on. > I will remove my development commentaries and create a private > repository at bitbucket I agree with the others that it's best to open up your repository for everyone who is interested. I can see no reason why you would want to close it back down once it's there. Stefan From v+python at g.nevcal.com Wed Aug 31 10:09:25 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 31 Aug 2011 01:09:25 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E5DEC35.4010404@g.nevcal.com> On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote: > Guido van Rossum writes: > > On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull wrote: > > > > For starters, one that doesn't ever return lone surrogates, but rather > > > interprets surrogate pairs as Unicode code points as in UTF-16. (This > > > is not a Unicode standard definition, it's intended to be suggestive > > > of why many app writers will be distressed if they must use Python > > > unicode/str in a narrow build without a fairly comprehensive library > > > that wraps the arrays in operations that treat unicode/str as an array > > > of code points.) > > > > That sounds like a contradiction -- it wouldn't be a UTF-16 array if > > you couldn't tell that it was using UTF-16. > > Well, that's why I wrote "intended to be suggestive". The Unicode > Standard does not specify at all what the internal representation of > characters may be, it only specifies what their external behavior must > be when two processes communicate. (For "process" as used in the > standard, think "Python modules" here, since we are concerned with the > problems of folks who develop in Python.) When observing the behavior > of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or > even UTF-32 arrays; only arrays of characters. > > Thus, according to the rules of handling a UTF-16 stream, it is an > error to observe a lone surrogate or a surrogate pair that isn't a > high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and > C8-C10). That's what I mean by "can't tell it's UTF-16". And I > understand those requirements to mean that operations on UTF-16 > streams should produce UTF-16 streams, or raise an error. Without > that closure property for basic operations on str, I think it's a bad > idea to say that the representation of text in a str in a pre-PEP-393 > "narrow" build is UTF-16. For many users and app developers, it > creates expectations that are not fulfilled. > > It's true that common usage is that an array of code units that > usually conforms to UTF-16 may be called "UTF-16" without the closure > properties. I just disagree with that usage, because there are two > camps that interpret "UTF-16" differently. One side says, "we have an > array representation in UTF-16 that can handle all Unicode code points > efficiently, and if you think you need more, think again", while the > other says "it's too painful to have to check every result for valid > UTF-16, and we need a UTF-16 type that supports the usual array > operations on *characters* via the usual operators; if you think > otherwise, think again." > > Note that despite the (presumed) resolution of the UTF-16 issue for > CPython by PEP 393, at some point a very similar discussion will take > place over "characters" anyway, because users and app developers are > going to want a type that handles composition sequences and/or > grapheme clusters for them, as well as comparison that respects > canonical equivalence, even if it is inefficient compared to str. > That's why I insisted on use of "array of code points" to describe the > PEP 393 str type, rather than "array of characters". On topic: So from reading all this discussion, I think this point is rather a key one... and it has been made repeatedly in different ways: Arrays are not suitable for manipulating Unicode character sequences, and the str type is an array with a veneer of text manipulation operations, which do not, and cannot, by themselves, efficiently implement Unicode character sequences. Python wants to, should, and can implement UTF-16 streams, UTF-8 streams, and UTF-32 streams. It should, and can implement streams using other encodings as well, and also binary streams. Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and 64-bit arrays. These are efficient to access, index, and slice. Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays called str (this will be more true post-PEP 393, although it is true with caveats presently), which interpret array elements as code units (currently) or codepoints (post-PEP), and implements operations that are interesting for text processing, with caveats. There is presently no support for arrays of Unicode grapheme clusters or composed characters. The Python type called str may or may not be properly documented (to the extent that there is confusion between the actual contents of the elements of the type, and the concept of character as defined by Unicode). From comments Guido has made, he is not interested in changing the efficiency or access methods of the str type to raise the level of support of Unicode to the composed character, or grapheme cluster concepts. The str type itself can presently be used to process other character encodings: if they are fixed width < 32-bit elements those encodings might be considered Unicode encodings, but there is no requirement that they are, and some operations on str may operate with knowledge of some Unicode semantics, so there are caveats. So it seems that any semantics in support of composed characters, grapheme clusters, or codepoints-stored-as-<32-bit-code-units, must be created as either an add-on Python package (in Python) or C extension, or a combination. It could be based on extensions to the existing str type, or it could be based on the array type, or it could based on the bytes type. It could use an internal format of 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or 16-bit codeunits. In addition to the expected stream operations, character length, indexing, and slicing operations, additional more complex operations would be expected on Unicode string values: regular expressions, comparisons, collations, case-shifting, and perhaps more. RTL and LTR awareness would add complexity to all operations, or at least variants of all operations. The questions are: 1) Is anyone interested in writing a PEP for such a thing? 2) Is anyone interested in writing an implementation for such a thing? 3) How many conflicting opinions and arguments will be spawned, making the poor person or persons above lose interest? Brainstorming ideas (which may wander off-topic in some regards, but were all inspired by this discussion): BI-0: Tom's analysis makes me think that the UTF-8 encoding, since it is smallest on the average language, and an implementation based on a foundation type of bytes or 'B' arrays, plus some side indexes of some sort, could be an appropriate starting point. UTF-8 is variable length, but so are composed characters and grapheme clusters. Building an array, each of whose units could hold the largest grapheme cluster would seem extremely inefficient, just like 32-bit Unicode is extremely inefficient for dealing with ASCII, so variable length units seem to be an imperative part of a solution. At least until one thinks up BI-2. BI-1: Perhaps a 32-bit base, with the upper 11 bits used to cache character characteristics from various character attribute database lookups could be an effective alternative, but wouldn't eliminate the need for dealing with variable length units for length, indexing, and slicing operations. BI-2: Maybe a 32-bit base would be useful so that one high bit could be used to flag that this character position actually holds an index to a multi-codepoint character, and the index would then hold the actual codes for that character. This would allow for at most 2^31 (and memory limited) different multi-codepoint characters in a string (or perhaps per application, if the multi-codepoint characters are shared between strings), but would suddenly allow array indexing of grapheme clusters and composed characters... with double-indexing required for multi-codepoint character access. [This idea seems similar to one that was mentioned elsewhere in this thread, suggesting that private use characters could be used to represent multi-codepoint characters, but (a) doesn't infringe on private uses, and (b) allows for a greater number of multi-codepoint characters to be used.] BI-3: both BI-1 and BI-2 would also allow themselves to be built on top of PEP 393 str... allowing multi-codepoint-character-supporting applications to benefit from the space efficiencies of PEP 393 when no multi-codepoint characters are fed into the application. BI-4: Since Unicode has 21-bit codepoints, one wonders if 24-bit array elements might be appropriate, rather than 32-bit. BI-2 could still operate, with a theoretical reduction to 2^23 possible multi-codepoint characters in an application. Access would be less efficient, but still O(1), and 25% of the space would be saved. This idea could be applied to PEP 393 independently of multi-codepoint character support. BI-5: I'm pretty sure there are inappropriate or illegal sequences of combining characters that should not stand alone. One example of this is lone surrogates. Such characters out of an appropriate sequence could be flagged with a high-bit so that they could be quickly recognized as illegal Unicode, but codecs could be provided to allow them to round-trip, and applications could recognize immediately that they should be handled as "binary gibberish" in an otherwise Unicode stream. This idea could be applied to PEP 393 independently of additional multi-codepoint character support. BI-6: Maybe another high bit could be used with a different codec error handler instead of using lone surrogates when decoding not-quite-conformant byte streams (such as OS filenames). Sad we didn't think of this one before doing all the lone surrogate stuff. Of course, this solution wouldn't work on narrow builds, because not even surrogates can represent high bits above Unicode codepoints! But once we have PEP 393, we _could_ replace inappropriate use of lone surrogates, with use of out-of-the-Unicode-codepoint range integers, without introducing ambiguity in the interpretation of lone surrogates. This idea could be applied to PEP 393 independently of multi-codepoint character support. Glenn -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Wed Aug 31 14:21:38 2011 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 31 Aug 2011 21:21:38 +0900 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5DEC35.4010404@g.nevcal.com> References: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> Message-ID: <87vctdkbzh.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > From comments Guido has made, he is not interested in changing the > efficiency or access methods of the str type to raise the level of > support of Unicode to the composed character, or grapheme cluster > concepts. IMO, that would be a bad idea, as higher-level Unicode support should either be a wrapper around full implementations such as ICU (or platform support in .NET or Java), or written in pure Python at first. Thus there is a need for an efficient array of code units type. PEP 393 allows this to go to the level of code points, but evidently that is inappropriate for Jython and IronPython. > The str type itself can presently be used to process other > character encodings: Not really. Remember, on input codecs always decode to Unicode and on output they always encode from Unicode. How do you propose to get other encodings into the array of code units? > [A "true Unicode" type] could be based on extensions to the > existing str type, or it could be based on the array type, or it > could based on the bytes type. It could use an internal format of > 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or > 16-bit codeunits. In theory yes, but in practice all of the string methods and libraries like re operate on str (and often but not always bytes; in particular, codecs always decode from byte and encode to bytes). Why bother with anything except arrays of code points at the start? PEP 393 makes that time-efficient and reasonably space-efficient as a starting point and allows starting with re or MRAB's regex to get basic RE functionality or good UTS #18 functionality respectively. Plus str already has all the usual string operations (.startswith(), .join(), etc), and we have modules for dealing with the Unicode Character Database. Why waste effort reintegrating with all that, until we have common use cases that need more efficient representation? There would be some issue in coming up with an appropriate UTF-16 to code point API for Jython and IronPython, but Terry Reedy has a rather efficient library for that already. So this discussion of alternative representations, including use of high bits to represent properties, is premature optimization ... especially since we don't even have a proto-PEP specifying how much conformance we want of this new "true Unicode" type in the first place. We need to focus on that before optimizing anything. From stefan at brunthaler.net Wed Aug 31 18:54:41 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Wed, 31 Aug 2011 09:54:41 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: > I think that you must deal with big endianess because some RISC can't handle > at all data in little endian format. > > In WPython I have wrote some macros which handle both endianess, but lacking > big endian machines I never had the opportunity to verify if something was > wrong. > I am sorry for the temporal lapse of not getting back to this directly yesterday, we were just heading out for lunch and I figured it only out then but immediately forgot it on our way back to the lab... So, as I have already said, I evaluated my optimizations on x86 (little-endian) and PowerPC 970 (big-endian) and I did not have to change any of my instruction decoding during interpretation. (The only nasty bug I still remember vividly was that while on gcc for x86 the data type char defaults to signed, whereas it defaults to unsigned on PowerPC's gcc.) When I have time and access to a PowerPC machine again (an ARM might be interesting, too), I will take a look at the generated assembly code to figure out why this is working. (I have some ideas why it might work without changing the code.) If I run into any problems, I'll gladly contact you :) BTW: AFAIR, we emailed last year regarding wpython and IIRC your optimizations could primarily be summarized as clever superinstructions. I have not implemented anything in that area at all (and have in fact not even touched the compiler and its peephole optimizer), but if parts my implementation gets in, I am sure that you could add some of your work on top of that, too. Cheers, --stefan From stefan at brunthaler.net Wed Aug 31 19:08:12 2011 From: stefan at brunthaler.net (stefan brunthaler) Date: Wed, 31 Aug 2011 10:08:12 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: > So, basically, you built a JIT compiler but don't want to call it that, > right? Just because it compiles byte code to other byte code rather than to > native CPU instructions does not mean it doesn't compile Just In Time. > For me, a definition of a JIT compiler or any dynamic compilation subsystem entails that native machine code is generated at run-time. Furthermore, I am not compiling from bytecode to bytecode, but rather changing the instruction encoding underneath and use subsequently use quickening to optimize interpretation. But, OTOH, I am not aware of a canonical definition of JIT compilation, so it depends ;) > I agree with the others that it's best to open up your repository for > everyone who is interested. I can see no reason why you would want to close > it back down once it's there. > Well, my code has primarily been a vehicle for my research in that area and thus is not immediately suited to adoption (it does not adhere to Python C coding standards, contains lots of private comments about various facts, debugging hints, etc.). The explanation for this is easy: When I started out on my research it was far from clear that it would be successful and really that much faster. So, I would like to clean up the comments and some parts of the code and publish the code I have without any of the clean-up work for naming conventions, etc., so that you can all take a look and it is clear what it's all about. After that we can then have a factual discussion about whether it fits the bill for you, too, and if so, which changes (naming conventions, extensive documentation, etc.) are necessary *before* any adoption is reasonable for you, too. That seems to be a good way to start off and get results and feedback quickly, any ideas/complaints/comments/suggestions? Best regards, --stefan PS: I am using Nick's suggested plan to incorporate my changes directly to the most recent version, as mine is currently only running on Python 3.1. From guido at python.org Wed Aug 31 19:10:16 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Aug 2011 10:10:16 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4E537EEC.1070602@v.loewis.de> <1314099542.3485.10.camel@localhost.localdomain> <4E53945E.1050102@v.loewis.de> <1314101745.3485.18.camel@localhost.localdomain> <4E53A5D1.2040808@v.loewis.de> <4E53A950.30005@haypocalc.com> <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull wrote: [me] > ?> That sounds like a contradiction -- it wouldn't be a UTF-16 array if > ?> you couldn't tell that it was using UTF-16. > > Well, that's why I wrote "intended to be suggestive". ?The Unicode > Standard does not specify at all what the internal representation of > characters may be, it only specifies what their external behavior must > be when two processes communicate. ?(For "process" as used in the > standard, think "Python modules" here, since we are concerned with the > problems of folks who develop in Python.) ?When observing the behavior > of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or > even UTF-32 arrays; only arrays of characters. Hm, that's not how I would read "process". IMO that is an intentionally vague term, and we are free to decide how to interpret it. I don't think it will work very well to define a process as a Python module; what about Python modules that agree about passing along array of code units (or streams of UTF-8, for that matter)? This is why I find the issue of Python, the language (and stdlib), as a whole "conforming to the Unicode standard" such a troublesome concept -- I think it is something that an application may claim, but the language should make much more modest claims, such as "the regular expression syntax supports features X, Y and Z from the Unicode recommendation XXX, or "the UTF-8 codec will never emit a sequence of bytes that is invalid according Unicode specification YYY". (As long as the Unicode references are also versioned or dated.) I'm fine with saying "it is hard to write Unicode-conforming application code for reason ZZZ" and proposing a fix (e.g. PEP 393 fixes a specific complaint about code units being inferior to code points for most types of processing). I'm not fine with saying "the string datatype should conform to the Unicode standard". > Thus, according to the rules of handling a UTF-16 stream, it is an > error to observe a lone surrogate or a surrogate pair that isn't a > high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and > C8-C10). ?That's what I mean by "can't tell it's UTF-16". But if you can observe (valid) surrogate pairs it is still UTF-16. > And I > understand those requirements to mean that operations on UTF-16 > streams should produce UTF-16 streams, or raise an error. ?Without > that closure property for basic operations on str, I think it's a bad > idea to say that the representation of text in a str in a pre-PEP-393 > "narrow" build is UTF-16. ?For many users and app developers, it > creates expectations that are not fulfilled. Ok, I dig this, to some extent. However saying it is UCS-2 is equally bad. I guess this is why Java and .NET just say their string types contain arrays of "16-bit characters", with essentially no semantics attached to the word "character" besides "16-bit unsigned integer". At the same time I think it would be useful if certain string operations like .lower() worked in such a way that *if* the input were valid UTF-16, *then* the output would also be, while *if* the input contained an invalid surrogate, the result would simply be something that is no worse (in particular, those are all mapped to themselves). We could even go further and have .lower() and friends look at graphemes (multi-code-point characters) if the Unicode std has a useful definition of e.g. lowercasing graphemes that differed from lowercasing code points. An analogy is actually found in .lower() on 8-bit strings in Python 2: it assumes the string contains ASCII, and non-ASCII characters are mapped to themselves. If your string contains Latin-1 or EBCDIC or UTF-8 it will not do the right thing. But that doesn't mean strings cannot contain those encodings, it just means that the .lower() method is not useful if they do. (Why ASCII? Because that is the system encoding in Python 2.) > It's true that common usage is that an array of code units that > usually conforms to UTF-16 may be called "UTF-16" without the closure > properties. ?I just disagree with that usage, because there are two > camps that interpret "UTF-16" differently. ?One side says, "we have an > array representation in UTF-16 that can handle all Unicode code points > efficiently, and if you think you need more, think again", while the > other says "it's too painful to have to check every result for valid > UTF-16, and we need a UTF-16 type that supports the usual array > operations on *characters* via the usual operators; if you think > otherwise, think again." I think we should just document how it behaves and not get hung up on what it is called. Mentioning UTF-16 is still useful because it indicates that some operations may act properly on surrogate pairs. (Also because of course character properties for BMP characters are respected, etc.) > Note that despite the (presumed) resolution of the UTF-16 issue for > CPython by PEP 393, at some point a very similar discussion will take > place over "characters" anyway, because users and app developers are > going to want a type that handles composition sequences and/or > grapheme clusters for them, as well as comparison that respects > canonical equivalence, even if it is inefficient compared to str. > That's why I insisted on use of "array of code points" to describe the > PEP 393 str type, rather than "array of characters". Let's call those things graphemes (Tom C's term, I quite like leaving "character" ambiguous) -- they are sequences of multiple code points that represent a single "visual squiggle" (the kind of thing that you'd want to be swappable in vim with "xp" :-). I agree that APIs are needed to manipulate (match, generate, validate, mutilate, etc.) things at the grapheme level. I don't agree that this means a separate data type is required. There are ever-larger units of information encoded in text strings, with ever farther-reaching (and more vague) requirements on valid sequences. Do you want to have a data type that can represent (only valid) words in a language? Sentences? Novels? I think that at this point in time the best we can do is claim that Python (the language standard) uses either 16-bit code units or 21-bit code points in its string datatype, and that, thanks to PEP 393, CPython 3.3 and further will always use 21-bit code points (but Jython and IronPython may forever use their platform's native 16-bit code unit representing string type). And then we add APIs that can be used everywhere to look for code points (even if the string contains code points), graphemes, or larger constructs. I'd like those APIs to be designed using a garbage-in-garbage-out principle, where if the input conforms to some Unicode requirement, the output does too, but if the input doesn't, the output does what makes most sense. Validation is then limited to codecs, and optional calls. If you index or slice a string, or create a string from chr() of a surrogate or from some other value that the Unicode standard considers an illegal code point, you better know what you are doing. I want chr(i) to be valid for all values of i in range(2**21), so it can be used to create a lone surrogate, or (on systems with 16-bit "characters") a surrogate pair. And also ord(chr(i)) == i for all i in range(2**21). I'm not sure about ord() on a 2-character string containing a surrogate pair on systems where strings contain 21-bit code points; I think it should be an error there, just as ord() on other strings of length != 1. But on systems with 16-bit "characters", ord() of strings of length 2 containing a valid surrogate pair should work. -- --Guido van Rossum (python.org/~guido) From guido at python.org Wed Aug 31 19:12:44 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Aug 2011 10:12:44 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5DEC35.4010404@g.nevcal.com> References: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> Message-ID: On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman wrote: > So from reading all this discussion, I think this point is rather a key > one... and it has been made repeatedly in different ways:? Arrays are not > suitable for manipulating Unicode character sequences, and the str type is > an array with a veneer of text manipulation operations, which do not, and > cannot, by themselves, efficiently implement Unicode character sequences. I think this is too strong. The str type is indeed an array, and you can build useful Unicode manipulation APIs on top of it. Just like bytes are not UTF-8, but can be used to represent UTF-8 and a fully-compliant UTF-8 codec can be implemented on top of it. -- --Guido van Rossum (python.org/~guido) From guido at python.org Wed Aug 31 19:20:19 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Aug 2011 10:20:19 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5DEC35.4010404@g.nevcal.com> References: <87r54bb4mq.fsf@uwakimon.sk.tsukuba.ac.jp> <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> Message-ID: On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman wrote: > The str type itself can presently be used to process other > character encodings: if they are fixed width < 32-bit elements those > encodings might be considered Unicode encodings, but there is no requirement > that they are, and some operations on str may operate with knowledge of some > Unicode semantics, so there are caveats. Actually, the str type in Python 3 and the unicode type in Python 2 are constrained everywhere to either 16-bit or 21-bit "characters". (Except when writing C code, which can do any number of invalid things so is the equivalent of assuming 1 == 0.) In particular, on a wide build, there is no way to get a code point >= 2**21, and I don't want PEP 393 to change this. So at best we can use these types to repesent arrays of 21-bit unsigned ints. But I think it is more useful to think of them as always representing "some form of Unicode", whether that is UTF-16 (on narrow builds) or 21-bit code points or perhaps some vaguely similar superset -- but for those code units/code points that are representable *and* valid (either code points or code units) according to the (supported version of) the Unicode standard, the meaning of those code points/units matches that of the standard. Note that this is different from the bytes type, where the meaning of a byte is entirely determined by what it means in the programmer's head. -- --Guido van Rossum (python.org/~guido) From guido at python.org Wed Aug 31 19:28:57 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Aug 2011 10:28:57 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> Message-ID: On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro wrote: > It isn't, because motivation to do something new with CPython vanishes, at > least on some areas (virtual machine / ceval.c), even having some ideas to > experiment with. That's why in my last talk on EuroPython I decided to move > on other areas (Python objects). Cesare, I'm really sorry that you became so disillusioned that you abandoned wordcode. I agree that we were too optimistic about Unladen Swallow. Also that the existence of PyPy and its PR machine (:-) should not stop us from improving CPython. I'm wondering if, with your experience in creating WPython, you could review Stefan Brunthaler's code and approach (once he's put it up for review) and possibly the two of you could even work on a joint project? -- --Guido van Rossum (python.org/~guido) From guido at python.org Wed Aug 31 19:31:13 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Aug 2011 10:31:13 -0700 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: On Wed, Aug 31, 2011 at 10:08 AM, stefan brunthaler wrote: > Well, my code has primarily been a vehicle for my research in that > area and thus is not immediately suited to adoption [...]. But if you want to be taken seriously as a researcher, you should publish your code! Without publication of your *code* research in your area cannot be reproduced by others, so it is not science. Please stop being shy and open up what you have. The software engineering issues can be dealt with separately! -- --Guido van Rossum (python.org/~guido) From v+python at g.nevcal.com Wed Aug 31 20:51:28 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 31 Aug 2011 11:51:28 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> Message-ID: <4E5E82B0.4020302@g.nevcal.com> On 8/31/2011 10:12 AM, Guido van Rossum wrote: > On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman wrote: >> So from reading all this discussion, I think this point is rather a key >> one... and it has been made repeatedly in different ways: Arrays are not >> suitable for manipulating Unicode character sequences, and the str type is >> an array with a veneer of text manipulation operations, which do not, and >> cannot, by themselves, efficiently implement Unicode character sequences. > I think this is too strong. The str type is indeed an array, and you > can build useful Unicode manipulation APIs on top of it. Just like > bytes are not UTF-8, but can be used to represent UTF-8 and a > fully-compliant UTF-8 codec can be implemented on top of it. > This statement is a logical conclusion of arguments presented in this thread. 1) Applications that wish to do grapheme access, wish to do it by grapheme array indexing, because that is the efficient way to do it. 2) As long as str is restricted to holding Unicode code units or code points, then it cannot support grapheme array indexing efficiently. I have not declared that useful Unicode manipulations APIs cannot be built on top of str, only that efficiency will suffer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Wed Aug 31 20:56:03 2011 From: guido at python.org (Guido van Rossum) Date: Wed, 31 Aug 2011 11:56:03 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <4E5E82B0.4020302@g.nevcal.com> References: <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> <4E5E82B0.4020302@g.nevcal.com> Message-ID: On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman wrote: > On 8/31/2011 10:12 AM, Guido van Rossum wrote: > > On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman wrote: > > So from reading all this discussion, I think this point is rather a key > one... and it has been made repeatedly in different ways: Arrays are not > suitable for manipulating Unicode character sequences, and the str type is > an array with a veneer of text manipulation operations, which do not, and > cannot, by themselves, efficiently implement Unicode character sequences. > > I think this is too strong. The str type is indeed an array, and you > can build useful Unicode manipulation APIs on top of it. Just like > bytes are not UTF-8, but can be used to represent UTF-8 and a > fully-compliant UTF-8 codec can be implemented on top of it. > > > > This statement is a logical conclusion of arguments presented in this > thread. > > 1) Applications that wish to do grapheme access, wish to do it by grapheme > array indexing, because that is the efficient way to do it. > I don't believe that should be taken as gospel. In Perl, they don't do array indexing on strings at all, and use regex matching instead. An API that uses some kind of cursor on a string might work fine in Python too (for grapheme matching). 2) As long as str is restricted to holding Unicode code units or code > points, then it cannot support grapheme array indexing efficiently. > > I have not declared that useful Unicode manipulations APIs cannot be built > on top of str, only that efficiency will suffer. > But you have not proven it. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Aug 31 21:14:25 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 31 Aug 2011 12:14:25 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> <4E5E82B0.4020302@g.nevcal.com> Message-ID: <4E5E8811.90600@g.nevcal.com> On 8/31/2011 11:56 AM, Guido van Rossum wrote: > On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman > > wrote: > > On 8/31/2011 10:12 AM, Guido van Rossum wrote: >> On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman wrote: >>> So from reading all this discussion, I think this point is rather a key >>> one... and it has been made repeatedly in different ways: Arrays are not >>> suitable for manipulating Unicode character sequences, and the str type is >>> an array with a veneer of text manipulation operations, which do not, and >>> cannot, by themselves, efficiently implement Unicode character sequences. >> I think this is too strong. The str type is indeed an array, and you >> can build useful Unicode manipulation APIs on top of it. Just like >> bytes are not UTF-8, but can be used to represent UTF-8 and a >> fully-compliant UTF-8 codec can be implemented on top of it. >> > > This statement is a logical conclusion of arguments presented in > this thread. > > 1) Applications that wish to do grapheme access, wish to do it by > grapheme array indexing, because that is the efficient way to do it. > > > I don't believe that should be taken as gospel. In Perl, they don't do > array indexing on strings at all, and use regex matching instead. An > API that uses some kind of cursor on a string might work fine in > Python too (for grapheme matching). The last benchmark I saw, regexp in Perl is faster than regexp in Python; that was some years back, before regexp in Perl supported quite as much Unicode as it does now; not sure if someone has done a recent performance benchmarks; Tom's survey indicates that the functionality presently differs, so it is not clear if performance benchmarks are presently appropriate to attempt to measure Unicode operations in regexp between the two languages. That said, regexp, or some sort of cursor on a string, might be a workable solution. Will it have adequate performance? Perhaps, at least for some applications. Will it be as conceptually simple as indexing an array of graphemes? No. Will it ever reach the efficiency of indexing an array of graphemes? No. Does that matter? Depends on the application. > > 2) As long as str is restricted to holding Unicode code units or > code points, then it cannot support grapheme array indexing > efficiently. > > I have not declared that useful Unicode manipulations APIs cannot > be built on top of str, only that efficiency will suffer. > > > But you have not proven it. Do you disagree that indexing an array is more efficient than manipulating strings with regex or binary trees? I think not, because you are insistent that array indexing of str be preserved as O(1). I agree that I have not proven it; it largely depends on whether or not indexing by grapheme cluster is a useful operation in applications. Yet Stephen (I think) has commented that emacs performance goes down as soon as multi-byte characters are introduced into an edit buffer. So I think he has proven that efficiency can suffer, in some implementations/applications. Terry's O(k) implementation requires data beyond strings, and isn't O(1). -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Aug 31 21:14:52 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 31 Aug 2011 12:14:52 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> Message-ID: <4E5E882C.1050006@g.nevcal.com> On 8/31/2011 10:20 AM, Guido van Rossum wrote: > On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman wrote: >> The str type itself can presently be used to process other >> character encodings: if they are fixed width< 32-bit elements those >> encodings might be considered Unicode encodings, but there is no requirement >> that they are, and some operations on str may operate with knowledge of some >> Unicode semantics, so there are caveats. > Actually, the str type in Python 3 and the unicode type in Python 2 > are constrained everywhere to either 16-bit or 21-bit "characters". > (Except when writing C code, which can do any number of invalid things > so is the equivalent of assuming 1 == 0.) In particular, on a wide > build, there is no way to get a code point>= 2**21, and I don't want > PEP 393 to change this. So at best we can use these types to repesent > arrays of 21-bit unsigned ints. But I think it is more useful to think > of them as always representing "some form of Unicode", whether that is > UTF-16 (on narrow builds) or 21-bit code points or perhaps some > vaguely similar superset -- but for those code units/code points that > are representable *and* valid (either code points or code units) > according to the (supported version of) the Unicode standard, the > meaning of those code points/units matches that of the standard. > > Note that this is different from the bytes type, where the meaning of > a byte is entirely determined by what it means in the programmer's > head. > Sorry, my Perl background is leaking through. I didn't double check that str constrains the values of each element to range 0x110000 but I see now by testing that it does. For some of my ideas, then, either a subtype of str would have to be able to relax that constraint, or str would not be the appropriate base type to use (but there are other base types that could be used, so this is not a serious issue for the ideas). I have no problem with thinking of str as representing "some form of Unicode". None of my proposals change that, although they may change other things, and may invent new forms of Unicode representations. You have stated that it is better to document what str actually does, rather than attempt to adhere slavishly to Unicode standard concepts. The Unicode Consortium may well define legal, conforming bytestreams for communicating processes, but languages and applications are free to use other representations internally. We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams, or we can invent a representation (whether called str or something else) that is useful and efficient in practice. -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Aug 31 21:15:12 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 31 Aug 2011 12:15:12 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: <87vctdkbzh.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> <4E5DEC35.4010404@g.nevcal.com> <87vctdkbzh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E5E8840.4080600@g.nevcal.com> On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote: > Glenn Linderman writes: > > > From comments Guido has made, he is not interested in changing the > > efficiency or access methods of the str type to raise the level of > > support of Unicode to the composed character, or grapheme cluster > > concepts. > > IMO, that would be a bad idea, OK you agree with Guido. > as higher-level Unicode support should > either be a wrapper around full implementations such as ICU (or > platform support in .NET or Java), or written in pure Python at first. > Thus there is a need for an efficient array of code units type. PEP > 393 allows this to go to the level of code points, but evidently that > is inappropriate for Jython and IronPython. > > > The str type itself can presently be used to process other > > character encodings: > > Not really. Remember, on input codecs always decode to Unicode and on > output they always encode from Unicode. How do you propose to get > other encodings into the array of code units? Here are two ways, there may be more: custom codecs, direct assignment > > [A "true Unicode" type] could be based on extensions to the > > existing str type, or it could be based on the array type, or it > > could based on the bytes type. It could use an internal format of > > 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or > > 16-bit codeunits. > > In theory yes, but in practice all of the string methods and libraries > like re operate on str (and often but not always bytes; in particular, > codecs always decode from byte and encode to bytes). > > Why bother with anything except arrays of code points at the start? > PEP 393 makes that time-efficient and reasonably space-efficient as a > starting point and allows starting with re or MRAB's regex to get > basic RE functionality or good UTS #18 functionality respectively. > Plus str already has all the usual string operations (.startswith(), > .join(), etc), and we have modules for dealing with the Unicode > Character Database. Why waste effort reintegrating with all that, > until we have common use cases that need more efficient representation? String methods could be reimplemented on any appropriate type, of course. Rejecting alternatives too soon might make one miss the best design. > There would be some issue in coming up with an appropriate UTF-16 to > code point API for Jython and IronPython, but Terry Reedy has a rather > efficient library for that already. Yes, Terry's implementation is interesting, and inspiring, and that concept could be extended to a variety of interesting techniques: codepoint access of code unit representations, and multi-codepoint character access on top of either code unit or codepoint representations. > So this discussion of alternative representations, including use of > high bits to represent properties, is premature optimization > ... especially since we don't even have a proto-PEP specifying how > much conformance we want of this new "true Unicode" type in the first > place. > > We need to focus on that before optimizing anything. You may call it premature optimization if you like, or you can ignore the concepts and emails altogether. I call it brainstorming for ideas, looking for non-obvious solutions to the problem of representation of Unicode. I found your discussion of streams versus arrays, as separate concepts related to Unicode, along with Terry's bisect indexing implementation, to rather inspiring. Just because Unicode defines streams of codeunits of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when processes communicate and for storage (which is one way processes communicate), that doesn't imply that the internal representation of character strings in a programming language must use exactly that representation. While there are efficiencies in using the same representation as is used by the communications streams, there are also inefficiencies. I'm unaware of any current Python implementation that has chosen to use UTF-8 as the internal representation of character strings (I'm also aware Perl has made that choice), yet UTF-8 is one of the commonly recommend character representations on the Linux platform, from what I read. So in that sense, Python has rejected the idea of using the "native" or "OS configured" representation as its internal representation. So why, then, must one choose from a repertoire of Unicode-defined stream representations if they don't meet the goal of efficient length, indexing, or slicing operations on actual characters? -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Aug 31 22:04:01 2011 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 31 Aug 2011 13:04:01 -0700 Subject: [Python-Dev] PEP 393 Summer of Code Project In-Reply-To: References: <4E576793.2010203@v.loewis.de> <4E5824E1.9010101@udel.edu> <4E5869C2.2040008@udel.edu> <8420B962-0F4B-45D3-9B1A-0C5C3AD3676E@gmail.com> <87ippglw6b.fsf@uwakimon.sk.tsukuba.ac.jp> <20110829141440.2e2178c6@pitrou.net> <874o0ylsq6.fsf@uwakimon.sk.tsukuba.ac.jp> <1314724786.3554.1.camel@localhost.localdomain> <8739gil27m.fsf@uwakimon.sk.tsukuba.ac.jp> <87y5yajewn.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4E5E93B1.8070301@g.nevcal.com> On 8/31/2011 10:10 AM, Guido van Rossum wrote: > On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull > wrote: > [me] >> > That sounds like a contradiction -- it wouldn't be a UTF-16 array if >> > you couldn't tell that it was using UTF-16. >> >> Well, that's why I wrote "intended to be suggestive". The Unicode >> Standard does not specify at all what the internal representation of >> characters may be, it only specifies what their external behavior must >> be when two processes communicate. (For "process" as used in the >> standard, think "Python modules" here, since we are concerned with the >> problems of folks who develop in Python.) When observing the behavior >> of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or >> even UTF-32 arrays; only arrays of characters. > Hm, that's not how I would read "process". IMO that is an > intentionally vague term, and we are free to decide how to interpret > it. I don't think it will work very well to define a process as a > Python module; what about Python modules that agree about passing > along array of code units (or streams of UTF-8, for that matter)? > > This is why I find the issue of Python, the language (and stdlib), as > a whole "conforming to the Unicode standard" such a troublesome > concept -- I think it is something that an application may claim, but > the language should make much more modest claims, such as "the regular > expression syntax supports features X, Y and Z from the Unicode > recommendation XXX, or "the UTF-8 codec will never emit a sequence of > bytes that is invalid according Unicode specification YYY". (As long > as the Unicode references are also versioned or dated.) > > I'm fine with saying "it is hard to write Unicode-conforming > application code for reason ZZZ" and proposing a fix (e.g. PEP 393 > fixes a specific complaint about code units being inferior to code > points for most types of processing). I'm not fine with saying "the > string datatype should conform to the Unicode standard". > >> Thus, according to the rules of handling a UTF-16 stream, it is an >> error to observe a lone surrogate or a surrogate pair that isn't a >> high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and >> C8-C10). That's what I mean by "can't tell it's UTF-16". > But if you can observe (valid) surrogate pairs it is still UTF-16. > >> And I >> understand those requirements to mean that operations on UTF-16 >> streams should produce UTF-16 streams, or raise an error. Without >> that closure property for basic operations on str, I think it's a bad >> idea to say that the representation of text in a str in a pre-PEP-393 >> "narrow" build is UTF-16. For many users and app developers, it >> creates expectations that are not fulfilled. > Ok, I dig this, to some extent. However saying it is UCS-2 is equally > bad. I guess this is why Java and .NET just say their string types > contain arrays of "16-bit characters", with essentially no semantics > attached to the word "character" besides "16-bit unsigned integer". > > At the same time I think it would be useful if certain string > operations like .lower() worked in such a way that *if* the input were > valid UTF-16, *then* the output would also be, while *if* the input > contained an invalid surrogate, the result would simply be something > that is no worse (in particular, those are all mapped to themselves). > We could even go further and have .lower() and friends look at > graphemes (multi-code-point characters) if the Unicode std has a > useful definition of e.g. lowercasing graphemes that differed from > lowercasing code points. > > An analogy is actually found in .lower() on 8-bit strings in Python 2: > it assumes the string contains ASCII, and non-ASCII characters are > mapped to themselves. If your string contains Latin-1 or EBCDIC or > UTF-8 it will not do the right thing. But that doesn't mean strings > cannot contain those encodings, it just means that the .lower() method > is not useful if they do. (Why ASCII? Because that is the system > encoding in Python 2.) So if Python 3.3+ uses Unicode codepoints as its str representation, the analogy to ASCII and Python 2 would imply that it should permit out-of-range codepoints, if they can be represented in the underlying data values. Valid codecs would not create such on input, and Valid codecs would not accept such on output. Operations on codepoints should, like .lower(), use the identity operation when applied to non-codepoints. > >> It's true that common usage is that an array of code units that >> usually conforms to UTF-16 may be called "UTF-16" without the closure >> properties. I just disagree with that usage, because there are two >> camps that interpret "UTF-16" differently. One side says, "we have an >> array representation in UTF-16 that can handle all Unicode code points >> efficiently, and if you think you need more, think again", while the >> other says "it's too painful to have to check every result for valid >> UTF-16, and we need a UTF-16 type that supports the usual array >> operations on *characters* via the usual operators; if you think >> otherwise, think again." > I think we should just document how it behaves and not get hung up on > what it is called. Mentioning UTF-16 is still useful because it > indicates that some operations may act properly on surrogate pairs. > (Also because of course character properties for BMP characters are > respected, etc.) > >> Note that despite the (presumed) resolution of the UTF-16 issue for >> CPython by PEP 393, at some point a very similar discussion will take >> place over "characters" anyway, because users and app developers are >> going to want a type that handles composition sequences and/or >> grapheme clusters for them, as well as comparison that respects >> canonical equivalence, even if it is inefficient compared to str. >> That's why I insisted on use of "array of code points" to describe the >> PEP 393 str type, rather than "array of characters". > Let's call those things graphemes (Tom C's term, I quite like leaving > "character" ambiguous) -- they are sequences of multiple code points > that represent a single "visual squiggle" (the kind of thing that > you'd want to be swappable in vim with "xp" :-). I agree that APIs are > needed to manipulate (match, generate, validate, mutilate, etc.) > things at the grapheme level. I don't agree that this means a separate > data type is required. There are ever-larger units of information > encoded in text strings, with ever farther-reaching (and more vague) > requirements on valid sequences. Do you want to have a data type that > can represent (only valid) words in a language? Sentences? Novels? Interesting ideas. Once you break the idea that every code point must be directly indexed, and that higher level concepts can be abstracted, appropriate codecs could produce a sequence of words, instead of characters. It depends on the purpose of the application whether such is interesting or not. Working a bit with ebook searching algorithms lately, and one idea is to extract from the text a list of words, and represent the words with codes. Do the same for the search string. Then the search, instead of searching for characters and character strings, and skipping over punctuation, etc., it can simply search for the appropriate sequence of word codes. In this case, part of the usefulness of the abstraction is the elimination of punctuation, so it is more of an index to the character text rather an encoding of it... but if the encoding of the text extracted words, the creation of the index would then be extremely simple. I don't have applications in mind where representing sentences or novels would be particularly useful, but representing words could be extremely useful. Valid words? Given a language (or languages) and dictionary (or dictionaries), words could be flagged as valid or invalid for that dictionary. Representing invalid words, could be similar to the idea of the representing of invalid UTF-8 bytes using the lone-surrogate error handler... possible when the application requests such. > I think that at this point in time the best we can do is claim that > Python (the language standard) uses either 16-bit code units or 21-bit > code points in its string datatype, and that, thanks to PEP 393, > CPython 3.3 and further will always use 21-bit code points (but Jython > and IronPython may forever use their platform's native 16-bit code > unit representing string type). And then we add APIs that can be used > everywhere to look for code points (even if the string contains code > points), graphemes, or larger constructs. I'd like those APIs to be > designed using a garbage-in-garbage-out principle, where if the input > conforms to some Unicode requirement, the output does too, but if the > input doesn't, the output does what makes most sense. Validation is > then limited to codecs, and optional calls. So limiting the code point values to 21-bits (wasting 11 bits) only serves to prevent applications from using those 11 bits when they have extra-Unicode values to represent. There is no shortage of 32-bit datatypes to draw from, though, but it seems an unnecessary constraint if exact conformance to Unicode is not provided... conforming codecs wouldn't create such values on input nor accept them on output, so the constraint only serves to restrict applications from using all 32-bits of the underlying storage. > If you index or slice a string, or create a string from chr() of a > surrogate or from some other value that the Unicode standard considers > an illegal code point, you better know what you are doing. I want > chr(i) to be valid for all values of i in range(2**21), so it can be > used to create a lone surrogate, or (on systems with 16-bit > "characters") a surrogate pair. And also ord(chr(i)) == i for all i in > range(2**21). I'm not sure about ord() on a 2-character string > containing a surrogate pair on systems where strings contain 21-bit > code points; I think it should be an error there, just as ord() on > other strings of length != 1. But on systems with 16-bit "characters", > ord() of strings of length 2 containing a valid surrogate pair should > work. > Yep. So str != Unicode. You keep saying that :) And others point out how some applications would benefit from encapsulating the complexities of Unicode semantics at various higher levels of abstractions. Sure, it can be tacked on, by adding complex access methods to a subtype of str, but losing O(1) indexing of those higher abstractions when that route is chosen. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cesare.di.mauro at gmail.com Wed Aug 31 22:10:40 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 22:10:40 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> <4E5CA1F0.2070005@v.loewis.de> <20110830193806.0d718a56@pitrou.net> Message-ID: 2011/8/31 stefan brunthaler > > I think that you must deal with big endianess because some RISC can't > handle > > at all data in little endian format. > > > > In WPython I have wrote some macros which handle both endianess, but > lacking > > big endian machines I never had the opportunity to verify if something > was > > wrong. > > > I am sorry for the temporal lapse of not getting back to this directly > yesterday, we were just heading out for lunch and I figured it only > out then but immediately forgot it on our way back to the lab... > > So, as I have already said, I evaluated my optimizations on x86 > (little-endian) and PowerPC 970 (big-endian) and I did not have to > change any of my instruction decoding during interpretation. (The only > nasty bug I still remember vividly was that while on gcc for x86 the > data type char defaults to signed, whereas it defaults to unsigned on > PowerPC's gcc.) When I have time and access to a PowerPC machine again > (an ARM might be interesting, too), I will take a look at the > generated assembly code to figure out why this is working. (I have > some ideas why it might work without changing the code.) > > If I run into any problems, I'll gladly contact you :) > > BTW: AFAIR, we emailed last year regarding wpython and IIRC your > optimizations could primarily be summarized as clever > superinstructions. I have not implemented anything in that area at all > (and have in fact not even touched the compiler and its peephole > optimizer), but if parts my implementation gets in, I am sure that you > could add some of your work on top of that, too. > > Cheers, > --stefan > You're right. I took a look at our old e-mails, and I found more details about your work. It's definitely not affected by processor endianess, so you don't need any check: it just works, because you'll produce the new opcodes in memory, and consume them in memory as well. Looking at your examples, I think that WPython wordcodes usage can be useful only for the most simple ones. That's because superinstructions group together several actions that need to be splitted again to simpler ones by a tracing-JIT/compiler like your, if you want to keep it simple. You said that you added about 400 specialized instructions last year with the usual bytecodes, but wordcodes will require quite more (this can compromise performance on CPU with small data caches). So I think that it'll be better to finish your work, with all tests passed, before thinking about adding something on top (that, for me, sounds like a machine code JIT O:-) Regards, Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: From cesare.di.mauro at gmail.com Wed Aug 31 22:18:08 2011 From: cesare.di.mauro at gmail.com (Cesare Di Mauro) Date: Wed, 31 Aug 2011 22:18:08 +0200 Subject: [Python-Dev] Python 3 optimizations continued... In-Reply-To: References: <20110829231420.20c3516a@pitrou.net> <20110830025510.638b41d9@pitrou.net> Message-ID: 2011/8/31 Guido van Rossum > On Tue, Aug 30, 2011 at 10:04 PM, Cesare Di Mauro > wrote: > > It isn't, because motivation to do something new with CPython vanishes, > at > > least on some areas (virtual machine / ceval.c), even having some ideas > to > > experiment with. That's why in my last talk on EuroPython I decided to > move > > on other areas (Python objects). > > Cesare, I'm really sorry that you became so disillusioned that you > abandoned wordcode. I agree that we were too optimistic about Unladen > Swallow. Also that the existence of PyPy and its PR machine (:-) > should not stop us from improving CPython. > I never stopped thinking about new optimization. A lot can be made on CPython, even without resorting to something like JIT et all. > > I'm wondering if, with your experience in creating WPython, you could > review Stefan Brunthaler's code and approach (once he's put it up for > review) and possibly the two of you could even work on a joint > project? > > -- > --Guido van Rossum (python.org/~guido) > Yes, I can. I'll wait for Stefan to update its source (reaching Python 3.2 at least) as he has intended to do, and that everything is published, in order to review the code. I also agree with you that right now it doesn't need to look as state-of-the-art. First make it work, then make it nicer. ;) Regards, Cesare -------------- next part -------------- An HTML attachment was scrubbed... URL: