From paul at colomiets.name Tue Oct 1 21:17:11 2013 From: paul at colomiets.name (Paul Colomiets) Date: Tue, 1 Oct 2013 22:17:11 +0300 Subject: [Python-ideas] pprint in displayhook In-Reply-To: References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com> Message-ID: Hi, On Sun, Sep 29, 2013 at 11:38 PM, Serhiy Storchaka wrote: > > What should be changed in pprint? > Would be nice if it support custom types. Just my 2 cents -- Paul From robert.kern at gmail.com Tue Oct 1 22:00:27 2013 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 01 Oct 2013 21:00:27 +0100 Subject: [Python-ideas] pprint in displayhook In-Reply-To: References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com>

Message-ID: On 2013-10-01 20:17, Paul Colomiets wrote: > Hi, > > On Sun, Sep 29, 2013 at 11:38 PM, Serhiy Storchaka wrote: >> >> What should be changed in pprint? > > Would be nice if it support custom types. For what it's worth, I would like to point out that IPython uses an adaptation of Armin Ronacher's pretty.py for pretty-printing as the default displayhook. It is a nice design that supports custom types after-the-fact. https://github.com/ipython/ipython/blob/master/IPython/lib/pretty.py Armin's original code: http://dev.pocoo.org/hg/sandbox/file/tip/pretty/pretty.py -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From ncoghlan at gmail.com Wed Oct 2 01:20:34 2013 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 2 Oct 2013 09:20:34 +1000 Subject: [Python-ideas] pprint in displayhook In-Reply-To: References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com>

Message-ID: On 2 Oct 2013 05:45, "Paul Colomiets" wrote: > > Hi, > > On Sun, Sep 29, 2013 at 11:38 PM, Serhiy Storchaka wrote: > > > > What should be changed in pprint? > > > > Would be nice if it support custom types. Fixing pprint to allow customisation was a key part of the rationale for functools.singledispatch. I guess Lukasz just hasn't had time to work on the follow-up patch to refactor the pprint module (or else I just missed it on the tracker, which is entirely plausible). Cheers, Nick. > > Just my 2 cents > > -- > Paul > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Wed Oct 2 02:56:45 2013 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 2 Oct 2013 10:56:45 +1000 Subject: [Python-ideas] pprint in displayhook In-Reply-To: References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com> Message-ID: <20131002005644.GI7989@ando> On Sun, Sep 29, 2013 at 11:38:30PM +0300, Serhiy Storchaka wrote: > 28.09.13 07:17, Raymond Hettinger ???????(??): > >This might be a reasonable idea if pprint were in better shape. > >I think substantial work needs to be done on it, before it would > >be worthy of becoming the default method of display. > > What should be changed in pprint? I would like to see pprint be smarter about printing lists and dicts. At the moment, a long list is either printed all on one line, like the default display, or one item per line. This can end up as one long, narrow column, which is worse than the default. I'd like to see it be smarter about using multiple columns. E.g. pprint([1, 2, 3, ... 1000]) rather than this: [1, 2, 3, ... 998, 999, 1000] something like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000] -- Steven From robert.kern at gmail.com Wed Oct 2 17:31:58 2013 From: robert.kern at gmail.com (Robert Kern) Date: Wed, 02 Oct 2013 16:31:58 +0100 Subject: [Python-ideas] pprint in displayhook In-Reply-To: <20131002005644.GI7989@ando> References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com> <20131002005644.GI7989@ando> Message-ID: On 2013-10-02 01:56, Steven D'Aprano wrote: > On Sun, Sep 29, 2013 at 11:38:30PM +0300, Serhiy Storchaka wrote: >> 28.09.13 07:17, Raymond Hettinger ???????(??): >>> This might be a reasonable idea if pprint were in better shape. >>> I think substantial work needs to be done on it, before it would >>> be worthy of becoming the default method of display. >> >> What should be changed in pprint? > > I would like to see pprint be smarter about printing lists and dicts. At > the moment, a long list is either printed all on one line, like the > default display, or one item per line. This can end up as one long, > narrow column, which is worse than the default. I'd like to see it be > smarter about using multiple columns. > > E.g. pprint([1, 2, 3, ... 1000]) > > rather than this: > > [1, > 2, > 3, > ... > 998, > 999, > 1000] > > something like this: > > [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, > ... > 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000] As someone who has used pretty-printing as their default displayhook for a decade now via IPython, I have to say that this case happens much less often than one might expect. It *is* irritating the rare times it does come up, but less so than what I expect we would see from the false positives of a more intelligent algorithm. But I withhold final judgement until I see the actual results of such an algorithm. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From paul at colomiets.name Wed Oct 2 22:20:47 2013 From: paul at colomiets.name (Paul Colomiets) Date: Wed, 2 Oct 2013 23:20:47 +0300 Subject: [Python-ideas] pprint in displayhook In-Reply-To: References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com>

Message-ID: Hi, On Wed, Oct 2, 2013 at 2:20 AM, Nick Coghlan wrote: > Fixing pprint to allow customisation was a key part of the rationale for > functools.singledispatch. I guess Lukasz just hasn't had time to work on the > follow-up patch to refactor the pprint module (or else I just missed it on > the tracker, which is entirely plausible). > Nice. Any chances it will be in time for python 3.4? We are waiting for it for about a decade :) -- Paul From g.rodola at gmail.com Thu Oct 3 19:09:51 2013 From: g.rodola at gmail.com (Giampaolo Rodola') Date: Thu, 3 Oct 2013 19:09:51 +0200 Subject: [Python-ideas] Allow from foo import bar* Message-ID: I suppose this has already been proposed in past but couldn't find any online reference so here goes. When it comes to module constant imports I usually like being explicit it's OK with me as long as I have to do: >>> from resource import (RLIMIT_CORE, RLIMIT_CPU, RLIMIT_FSIZE) Nevertheless in case the existence of certain constants depends on the platform in use I end up doing: >>> if hasattr(resource, "RLIMIT_MSGQUEUE"): # linux only .... import resource.RLIMIT_MSGQUEUE .... >>> if hasattr(resource, "RLIMIT_NICE"): # linux only .... import resource.RLIMIT_NICE .... ...or worse, if for simplicity I'm willing to simply import all RLIMIT_* constants I'll have to do this: >>> import resource >>> import sys >>> for name in dir(resource): .... if name.startswith('RLIMIT_'): .... setattr(sys.modules[__name__], name, getattr(resource, name)) ...or just give up and use: from resource import * ...which of course will pollute the namespace with unnecessary stuff. So why not just allow "from resource import RLIMIT_*" syntax? Another interesting variation might be: >>> from socket import AF_*, SOCK_* >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM (2, 10, 1, 2) On the other hand mixing "*" and "common" imports would be forbidden: >>> from socket import AF_*, socket, File "", line 1 from socket import AF_*, socket ^ SyntaxError: invalid syntax; Thoughts? --- Giampaolo https://code.google.com/p/pyftpdlib/ https://code.google.com/p/psutil/ https://code.google.com/p/pysendfile/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Oct 3 19:16:37 2013 From: guido at python.org (Guido van Rossum) Date: Thu, 3 Oct 2013 10:16:37 -0700 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References: Message-ID: Hm. Why not just use "import socket" and then use "socket.AF_"? On Thu, Oct 3, 2013 at 10:09 AM, Giampaolo Rodola' wrote: > I suppose this has already been proposed in past but couldn't find > any online reference so here goes. > When it comes to module constant imports I usually like being explicit > it's OK with me as long as I have to do: > > >>> from resource import (RLIMIT_CORE, RLIMIT_CPU, RLIMIT_FSIZE) > > Nevertheless in case the existence of certain constants depends on the > platform in use I end up doing: > > >>> if hasattr(resource, "RLIMIT_MSGQUEUE"): # linux only > .... import resource.RLIMIT_MSGQUEUE > .... > >>> if hasattr(resource, "RLIMIT_NICE"): # linux only > .... import resource.RLIMIT_NICE > .... > > > ...or worse, if for simplicity I'm willing to simply import all RLIMIT_* > constants I'll have to do this: > > >>> import resource > >>> import sys > >>> for name in dir(resource): > .... if name.startswith('RLIMIT_'): > .... setattr(sys.modules[__name__], name, getattr(resource, name)) > > ...or just give up and use: > > from resource import * > > ...which of course will pollute the namespace with unnecessary stuff. > So why not just allow "from resource import RLIMIT_*" syntax? > Another interesting variation might be: > > > >>> from socket import AF_*, SOCK_* > >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM > (2, 10, 1, 2) > > > On the other hand mixing "*" and "common" imports would be forbidden: > > >>> from socket import AF_*, socket, > File "", line 1 > from socket import AF_*, socket > ^ > SyntaxError: invalid syntax; > > > Thoughts? > > > --- Giampaolo > https://code.google.com/p/pyftpdlib/ > https://code.google.com/p/psutil/ > https://code.google.com/p/pysendfile/ > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at mrabarnett.plus.com Thu Oct 3 19:25:46 2013 From: python at mrabarnett.plus.com (MRAB) Date: Thu, 03 Oct 2013 18:25:46 +0100 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References: Message-ID: <524DA89A.6090608@mrabarnett.plus.com> On 03/10/2013 18:09, Giampaolo Rodola' wrote: > I suppose this has already been proposed in past but couldn't find > any online reference so here goes. > When it comes to module constant imports I usually like being explicit > it's OK with me as long as I have to do: > > >>> from resource import (RLIMIT_CORE, RLIMIT_CPU, RLIMIT_FSIZE) > > Nevertheless in case the existence of certain constants depends on the > platform in use I end up doing: > > >>> if hasattr(resource, "RLIMIT_MSGQUEUE"): # linux only > .... import resource.RLIMIT_MSGQUEUE > .... > >>> if hasattr(resource, "RLIMIT_NICE"): # linux only > .... import resource.RLIMIT_NICE > .... > > > ...or worse, if for simplicity I'm willing to simply import all RLIMIT_* > constants I'll have to do this: > > >>> import resource > >>> import sys > >>> for name in dir(resource): > .... if name.startswith('RLIMIT_'): > .... setattr(sys.modules[__name__], name, getattr(resource, name)) > > ...or just give up and use: > > from resource import * > > ...which of course will pollute the namespace with unnecessary stuff. > So why not just allow "from resource import RLIMIT_*" syntax? > Another interesting variation might be: > > > >>> from socket import AF_*, SOCK_* > >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM > (2, 10, 1, 2) > > > On the other hand mixing "*" and "common" imports would be forbidden: > > >>> from socket import AF_*, socket, > File "", line 1 > from socket import AF_*, socket > ^ > SyntaxError: invalid syntax; > > > Thoughts? > If you're importing RLIMIT_MSGQUEUE, then presumably you're using it somewhere(!), but if it's platform-specific, you'll still need to check which platform the code is running on anyway before trying to use it... From storchaka at gmail.com Thu Oct 3 20:42:03 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Thu, 03 Oct 2013 21:42:03 +0300 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References: Message-ID: 03.10.13 20:09, Giampaolo Rodola' ???????(??): > Another interesting variation might be: > > >>> from socket import AF_*, SOCK_* > >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM > (2, 10, 1, 2) >>> from socket import AddressFamily, SocketType >>> globals().update(AddressFamily.__members__) >>> globals().update(SocketType.__members__) >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM (, , , ) From g.rodola at gmail.com Thu Oct 3 20:43:09 2013 From: g.rodola at gmail.com (Giampaolo Rodola') Date: Thu, 3 Oct 2013 20:43:09 +0200 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References:

Message-ID: > Hm. Why not just use "import socket" and then use "socket.AF_"? That's what I usually do as well (because explicit is better than implicit) but from my understanding when it comes to constants it is generally not considered a bad practice to import them directly into the module namespace. I guess my specific case is bit different though. I have all these constants defined in a _linux.py submodule which I import from __init__.py in order to expose them publicly. And this is how I do that: # Linux >= 2.6.36 if _psplatform.HAS_PRLIMIT: from psutil._pslinux import (RLIM_INFINITY, RLIMIT_AS, RLIMIT_CORE, RLIMIT_CPU, RLIMIT_DATA, RLIMIT_FSIZE, RLIMIT_LOCKS, RLIMIT_MEMLOCK, RLIMIT_NOFILE, RLIMIT_NPROC, RLIMIT_RSS, RLIMIT_STACK) if hasattr(_psplatform, "RLIMIT_MSGQUEUE"): RLIMIT_MSGQUEUE = _psplatform.RLIMIT_MSGQUEUE if hasattr(_psplatform, "RLIMIT_NICE"): RLIMIT_NICE = _psplatform.RLIMIT_NICE if hasattr(_psplatform, "RLIMIT_RTPRIO"): RLIMIT_RTPRIO = _psplatform.RLIMIT_RTPRIO if hasattr(_psplatform, "RLIMIT_RTTIME"): RLIMIT_RTTIME = _psplatform.RLIMIT_RTTIME if hasattr(_psplatform, "RLIMIT_SIGPENDING"): RLIMIT_SIGPENDING = _psplatform.RLIMIT_SIGPENDING In *this specific case* a "from _psplatform import RLIM*" would have solved my problem nicely. On one hand this might look like encouraging wildcard import usage, but I think it's the opposite. Sometimes people use "from foo import *" just because "from foo import bar*" is not available. --- Giampaolo https://code.google.com/p/pyftpdlib/ https://code.google.com/p/psutil/ https://code.google.com/p/pysendfile/ On Thu, Oct 3, 2013 at 7:16 PM, Guido van Rossum wrote: > Hm. Why not just use "import socket" and then use "socket.AF_"? > > > On Thu, Oct 3, 2013 at 10:09 AM, Giampaolo Rodola' wrote: > >> I suppose this has already been proposed in past but couldn't find >> any online reference so here goes. >> When it comes to module constant imports I usually like being explicit >> it's OK with me as long as I have to do: >> >> >>> from resource import (RLIMIT_CORE, RLIMIT_CPU, RLIMIT_FSIZE) >> >> Nevertheless in case the existence of certain constants depends on the >> platform in use I end up doing: >> >> >>> if hasattr(resource, "RLIMIT_MSGQUEUE"): # linux only >> .... import resource.RLIMIT_MSGQUEUE >> .... >> >>> if hasattr(resource, "RLIMIT_NICE"): # linux only >> .... import resource.RLIMIT_NICE >> .... >> >> >> ...or worse, if for simplicity I'm willing to simply import all RLIMIT_* >> constants I'll have to do this: >> >> >>> import resource >> >>> import sys >> >>> for name in dir(resource): >> .... if name.startswith('RLIMIT_'): >> .... setattr(sys.modules[__name__], name, getattr(resource, name)) >> >> ...or just give up and use: >> >> from resource import * >> >> ...which of course will pollute the namespace with unnecessary stuff. >> So why not just allow "from resource import RLIMIT_*" syntax? >> Another interesting variation might be: >> >> >> >>> from socket import AF_*, SOCK_* >> >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM >> (2, 10, 1, 2) >> >> >> On the other hand mixing "*" and "common" imports would be forbidden: >> >> >>> from socket import AF_*, socket, >> File "", line 1 >> from socket import AF_*, socket >> ^ >> SyntaxError: invalid syntax; >> >> >> Thoughts? >> >> >> --- Giampaolo >> https://code.google.com/p/pyftpdlib/ >> https://code.google.com/p/psutil/ >> https://code.google.com/p/pysendfile/ >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> >> > > > -- > --Guido van Rossum (python.org/~guido) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Oct 3 21:00:39 2013 From: guido at python.org (Guido van Rossum) Date: Thu, 3 Oct 2013 12:00:39 -0700 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References:

Message-ID: Hm. It seems a pretty small use case for what would be a major implementation challenge -- I'm sure there would be lots of issues implementing this cleanly given all the special casing for import *, and the special handling of importlib during bootstrap. On Thu, Oct 3, 2013 at 11:43 AM, Giampaolo Rodola' wrote: > > Hm. Why not just use "import socket" and then use "socket.AF_"? > > That's what I usually do as well (because explicit is better than > implicit) but from my understanding when it comes to constants it is > generally not considered a bad practice to import them directly into the > module namespace. > I guess my specific case is bit different though. > I have all these constants defined in a _linux.py submodule which I import > from __init__.py in order to expose them publicly. > And this is how I do that: > > # Linux >= 2.6.36 > if _psplatform.HAS_PRLIMIT: > from psutil._pslinux import (RLIM_INFINITY, RLIMIT_AS, RLIMIT_CORE, > RLIMIT_CPU, RLIMIT_DATA, RLIMIT_FSIZE, > RLIMIT_LOCKS, RLIMIT_MEMLOCK, > RLIMIT_NOFILE, > RLIMIT_NPROC, RLIMIT_RSS, > RLIMIT_STACK) > if hasattr(_psplatform, "RLIMIT_MSGQUEUE"): > RLIMIT_MSGQUEUE = _psplatform.RLIMIT_MSGQUEUE > if hasattr(_psplatform, "RLIMIT_NICE"): > RLIMIT_NICE = _psplatform.RLIMIT_NICE > if hasattr(_psplatform, "RLIMIT_RTPRIO"): > RLIMIT_RTPRIO = _psplatform.RLIMIT_RTPRIO > if hasattr(_psplatform, "RLIMIT_RTTIME"): > RLIMIT_RTTIME = _psplatform.RLIMIT_RTTIME > if hasattr(_psplatform, "RLIMIT_SIGPENDING"): > RLIMIT_SIGPENDING = _psplatform.RLIMIT_SIGPENDING > > > In *this specific case* a "from _psplatform import RLIM*" would have > solved my problem nicely. > On one hand this might look like encouraging wildcard import usage, but I > think it's the opposite. > Sometimes people use "from foo import *" just because "from foo import > bar*" is not available. > > > --- Giampaolo > https://code.google.com/p/pyftpdlib/ > https://code.google.com/p/psutil/ > https://code.google.com/p/pysendfile/ > > > On Thu, Oct 3, 2013 at 7:16 PM, Guido van Rossum wrote: > >> Hm. Why not just use "import socket" and then use "socket.AF_"? >> >> >> On Thu, Oct 3, 2013 at 10:09 AM, Giampaolo Rodola' wrote: >> >>> I suppose this has already been proposed in past but couldn't find >>> any online reference so here goes. >>> When it comes to module constant imports I usually like being explicit >>> it's OK with me as long as I have to do: >>> >>> >>> from resource import (RLIMIT_CORE, RLIMIT_CPU, RLIMIT_FSIZE) >>> >>> Nevertheless in case the existence of certain constants depends on the >>> platform in use I end up doing: >>> >>> >>> if hasattr(resource, "RLIMIT_MSGQUEUE"): # linux only >>> .... import resource.RLIMIT_MSGQUEUE >>> .... >>> >>> if hasattr(resource, "RLIMIT_NICE"): # linux only >>> .... import resource.RLIMIT_NICE >>> .... >>> >>> >>> ...or worse, if for simplicity I'm willing to simply import all RLIMIT_* >>> constants I'll have to do this: >>> >>> >>> import resource >>> >>> import sys >>> >>> for name in dir(resource): >>> .... if name.startswith('RLIMIT_'): >>> .... setattr(sys.modules[__name__], name, getattr(resource, name)) >>> >>> ...or just give up and use: >>> >>> from resource import * >>> >>> ...which of course will pollute the namespace with unnecessary stuff. >>> So why not just allow "from resource import RLIMIT_*" syntax? >>> Another interesting variation might be: >>> >>> >>> >>> from socket import AF_*, SOCK_* >>> >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM >>> (2, 10, 1, 2) >>> >>> >>> On the other hand mixing "*" and "common" imports would be forbidden: >>> >>> >>> from socket import AF_*, socket, >>> File "", line 1 >>> from socket import AF_*, socket >>> ^ >>> SyntaxError: invalid syntax; >>> >>> >>> Thoughts? >>> >>> >>> --- Giampaolo >>> https://code.google.com/p/pyftpdlib/ >>> https://code.google.com/p/psutil/ >>> https://code.google.com/p/pysendfile/ >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> >>> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.rodola at gmail.com Thu Oct 3 21:13:34 2013 From: g.rodola at gmail.com (Giampaolo Rodola') Date: Thu, 3 Oct 2013 21:13:34 +0200 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References:

Message-ID: On Thu, Oct 3, 2013 at 9:00 PM, Guido van Rossum wrote: > Hm. It seems a pretty small use case for what would be a major > implementation challenge -- I'm sure there would be lots of issues > implementing this cleanly given all the special casing for import *, and > the special handling of importlib during bootstrap. > Fair enough. --- Giampaolo https://code.google.com/p/pyftpdlib/ https://code.google.com/p/psutil/ https://code.google.com/p/pysendfile/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua at landau.ws Thu Oct 3 21:59:42 2013 From: joshua at landau.ws (Joshua Landau) Date: Thu, 3 Oct 2013 20:59:42 +0100 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References:

Message-ID: On 3 October 2013 19:43, Giampaolo Rodola' wrote: >> Hm. Why not just use "import socket" and then use "socket.AF_"? > > That's what I usually do as well (because explicit is better than implicit) > but from my understanding when it comes to constants it is generally not > considered a bad practice to import them directly into the module namespace. > I guess my specific case is bit different though. > I have all these constants defined in a _linux.py submodule which I import > from __init__.py in order to expose them publicly. > And this is how I do that: > > # Linux >= 2.6.36 > if _psplatform.HAS_PRLIMIT: > from psutil._pslinux import (RLIM_INFINITY, RLIMIT_AS, RLIMIT_CORE, > RLIMIT_CPU, RLIMIT_DATA, RLIMIT_FSIZE, > RLIMIT_LOCKS, RLIMIT_MEMLOCK, ... > > In *this specific case* a "from _psplatform import RLIM*" would have solved > my problem nicely. Or we change the module such that we can do from psutil._pslinux import RLIMIT and then use RLIMIT.CORE, RLIMIT.CPU, RLIMIT.LOCKS, etc. From rymg19 at gmail.com Thu Oct 3 22:43:01 2013 From: rymg19 at gmail.com (Ryan Gonzalez) Date: Thu, 3 Oct 2013 15:43:01 -0500 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References:

Message-ID: Well, that looks painful! I agree with Joshua: If you are doing something like that, namespaces work best. If you're really that desperate, why not something like this: globals().update({name: getattr(_psplatform, name) for name in dir(_psplatform) if name.startswith('RLIMIT')}) On Thu, Oct 3, 2013 at 1:43 PM, Giampaolo Rodola' wrote: > > Hm. Why not just use "import socket" and then use "socket.AF_"? > > That's what I usually do as well (because explicit is better than > implicit) but from my understanding when it comes to constants it is > generally not considered a bad practice to import them directly into the > module namespace. > I guess my specific case is bit different though. > I have all these constants defined in a _linux.py submodule which I import > from __init__.py in order to expose them publicly. > And this is how I do that: > > # Linux >= 2.6.36 > if _psplatform.HAS_PRLIMIT: > from psutil._pslinux import (RLIM_INFINITY, RLIMIT_AS, RLIMIT_CORE, > RLIMIT_CPU, RLIMIT_DATA, RLIMIT_FSIZE, > RLIMIT_LOCKS, RLIMIT_MEMLOCK, > RLIMIT_NOFILE, > RLIMIT_NPROC, RLIMIT_RSS, > RLIMIT_STACK) > if hasattr(_psplatform, "RLIMIT_MSGQUEUE"): > RLIMIT_MSGQUEUE = _psplatform.RLIMIT_MSGQUEUE > if hasattr(_psplatform, "RLIMIT_NICE"): > RLIMIT_NICE = _psplatform.RLIMIT_NICE > if hasattr(_psplatform, "RLIMIT_RTPRIO"): > RLIMIT_RTPRIO = _psplatform.RLIMIT_RTPRIO > if hasattr(_psplatform, "RLIMIT_RTTIME"): > RLIMIT_RTTIME = _psplatform.RLIMIT_RTTIME > if hasattr(_psplatform, "RLIMIT_SIGPENDING"): > RLIMIT_SIGPENDING = _psplatform.RLIMIT_SIGPENDING > > > In *this specific case* a "from _psplatform import RLIM*" would have > solved my problem nicely. > On one hand this might look like encouraging wildcard import usage, but I > think it's the opposite. > Sometimes people use "from foo import *" just because "from foo import > bar*" is not available. > > > --- Giampaolo > https://code.google.com/p/pyftpdlib/ > https://code.google.com/p/psutil/ > https://code.google.com/p/pysendfile/ > > > On Thu, Oct 3, 2013 at 7:16 PM, Guido van Rossum wrote: > >> Hm. Why not just use "import socket" and then use "socket.AF_"? >> >> >> On Thu, Oct 3, 2013 at 10:09 AM, Giampaolo Rodola' wrote: >> >>> I suppose this has already been proposed in past but couldn't find >>> any online reference so here goes. >>> When it comes to module constant imports I usually like being explicit >>> it's OK with me as long as I have to do: >>> >>> >>> from resource import (RLIMIT_CORE, RLIMIT_CPU, RLIMIT_FSIZE) >>> >>> Nevertheless in case the existence of certain constants depends on the >>> platform in use I end up doing: >>> >>> >>> if hasattr(resource, "RLIMIT_MSGQUEUE"): # linux only >>> .... import resource.RLIMIT_MSGQUEUE >>> .... >>> >>> if hasattr(resource, "RLIMIT_NICE"): # linux only >>> .... import resource.RLIMIT_NICE >>> .... >>> >>> >>> ...or worse, if for simplicity I'm willing to simply import all RLIMIT_* >>> constants I'll have to do this: >>> >>> >>> import resource >>> >>> import sys >>> >>> for name in dir(resource): >>> .... if name.startswith('RLIMIT_'): >>> .... setattr(sys.modules[__name__], name, getattr(resource, name)) >>> >>> ...or just give up and use: >>> >>> from resource import * >>> >>> ...which of course will pollute the namespace with unnecessary stuff. >>> So why not just allow "from resource import RLIMIT_*" syntax? >>> Another interesting variation might be: >>> >>> >>> >>> from socket import AF_*, SOCK_* >>> >>> AF_INET, AF_INET6, SOCK_STREAM, SOCK_DGRAM >>> (2, 10, 1, 2) >>> >>> >>> On the other hand mixing "*" and "common" imports would be forbidden: >>> >>> >>> from socket import AF_*, socket, >>> File "", line 1 >>> from socket import AF_*, socket >>> ^ >>> SyntaxError: invalid syntax; >>> >>> >>> Thoughts? >>> >>> >>> --- Giampaolo >>> https://code.google.com/p/pyftpdlib/ >>> https://code.google.com/p/psutil/ >>> https://code.google.com/p/pysendfile/ >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> >>> >> >> >> -- >> --Guido van Rossum (python.org/~guido) >> > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -- Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Thu Oct 3 23:44:43 2013 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 4 Oct 2013 07:44:43 +1000 Subject: [Python-ideas] Allow from foo import bar* In-Reply-To: References:

Message-ID: On 4 Oct 2013 06:01, "Joshua Landau" wrote: > > On 3 October 2013 19:43, Giampaolo Rodola' wrote: > >> Hm. Why not just use "import socket" and then use "socket.AF_"? > > > > That's what I usually do as well (because explicit is better than implicit) > > but from my understanding when it comes to constants it is generally not > > considered a bad practice to import them directly into the module namespace. > > I guess my specific case is bit different though. > > I have all these constants defined in a _linux.py submodule which I import > > from __init__.py in order to expose them publicly. > > And this is how I do that: > > > > # Linux >= 2.6.36 > > if _psplatform.HAS_PRLIMIT: > > from psutil._pslinux import (RLIM_INFINITY, RLIMIT_AS, RLIMIT_CORE, > > RLIMIT_CPU, RLIMIT_DATA, RLIMIT_FSIZE, > > RLIMIT_LOCKS, RLIMIT_MEMLOCK, > ... > > > > In *this specific case* a "from _psplatform import RLIM*" would have solved > > my problem nicely. > > Or we change the module such that we can do > > from psutil._pslinux import RLIMIT > > and then use RLIMIT.CORE, RLIMIT.CPU, RLIMIT.LOCKS, etc. Another Enum candidate, perhaps? Cheers, Nick. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Fri Oct 4 21:17:14 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Fri, 04 Oct 2013 22:17:14 +0300 Subject: [Python-ideas] pprint in displayhook In-Reply-To: <20131002005644.GI7989@ando> References: <0BFAAF4A-F5C8-48FA-9C82-1B60D164A033@gmail.com> <20131002005644.GI7989@ando> Message-ID: 02.10.13 03:56, Steven D'Aprano ???????(??): > I would like to see pprint be smarter about printing lists and dicts. At > the moment, a long list is either printed all on one line, like the > default display, or one item per line. This can end up as one long, > narrow column, which is worse than the default. I'd like to see it be > smarter about using multiple columns. http://bugs.python.org/issue19132 From storchaka at gmail.com Tue Oct 8 13:17:59 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 08 Oct 2013 14:17:59 +0300 Subject: [Python-ideas] Add "has_surrogates" flags to string object Message-ID: Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states: * String doesn't contain surrogates. * String contains surrogates. * It is still unknown. We can combine this with "is_ascii" flag in 2-bit value: * String is ASCII-only (and doesn't contain surrogates). * String is not ASCII-only and doesn't contain surrogates. * String is not ASCII-only and contains surrogates. * String is not ASCII-only and it is still unknown if it contains surrogate. By default a string is created in "unknown" state (if it is UCS2 or UCS4). After first request it can be switched to "has surrogates" or "hasn't surrogates". State of the result of concatenating or slicing can be determined from states of input strings. This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little faster UTF-8 encoding) and converting to wchar_t* if string hasn't surrogates (this is true in most cases). From masklinn at masklinn.net Tue Oct 8 13:38:19 2013 From: masklinn at masklinn.net (Masklinn) Date: Tue, 8 Oct 2013 13:38:19 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: Message-ID: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote: > Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states: > > * String doesn't contain surrogates. > * String contains surrogates. > * It is still unknown. > > We can combine this with "is_ascii" flag in 2-bit value: > > * String is ASCII-only (and doesn't contain surrogates). > * String is not ASCII-only and doesn't contain surrogates. > * String is not ASCII-only and contains surrogates. > * String is not ASCII-only and it is still unknown if it contains surrogate. Isn't that redundant with the kind under shortest form representation? From solipsis at pitrou.net Tue Oct 8 13:43:43 2013 From: solipsis at pitrou.net (Antoine Pitrou) Date: Tue, 8 Oct 2013 13:43:43 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object References: Message-ID: <20131008134343.2e084051@pitrou.net> Le Tue, 08 Oct 2013 14:17:59 +0300, Serhiy Storchaka a ?crit : > Here is an idea about adding a mark to PyUnicode object which allows > fast answer to the question if a string has surrogate code. This mark > has one of three possible states: > > * String doesn't contain surrogates. > * String contains surrogates. > * It is still unknown. > > We can combine this with "is_ascii" flag in 2-bit value: > > * String is ASCII-only (and doesn't contain surrogates). > * String is not ASCII-only and doesn't contain surrogates. > * String is not ASCII-only and contains surrogates. > * String is not ASCII-only and it is still unknown if it contains > surrogate. > > By default a string is created in "unknown" state (if it is UCS2 or > UCS4). After first request it can be switched to "has surrogates" or > "hasn't surrogates". State of the result of concatenating or slicing > can be determined from states of input strings. Not true for slicing (you can take a non-surrogates slice of a surrogates string). Other than that, this sounds reasonable to me, provided that the patch isn't too complex and the perf improvements are worth it. Regards Antoine. From storchaka at gmail.com Tue Oct 8 13:43:51 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 08 Oct 2013 14:43:51 +0300 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net> Message-ID: 08.10.13 14:38, Masklinn ???????(??): > On 2013-10-08, at 13:17 , Serhiy Storchaka wrote: > >> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states: >> >> * String doesn't contain surrogates. >> * String contains surrogates. >> * It is still unknown. >> >> We can combine this with "is_ascii" flag in 2-bit value: >> >> * String is ASCII-only (and doesn't contain surrogates). >> * String is not ASCII-only and doesn't contain surrogates. >> * String is not ASCII-only and contains surrogates. >> * String is not ASCII-only and it is still unknown if it contains surrogate. > > Isn't that redundant with the kind under shortest form representation? No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and '\udc80\U00010000' is UCS4 string with surrogate code. UCS2 string without surrogate codes can be encoded in UTF-16 by memcpy(). From mal at egenix.com Tue Oct 8 13:58:00 2013 From: mal at egenix.com (M.-A. Lemburg) Date: Tue, 08 Oct 2013 13:58:00 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: Message-ID: <5253F348.3010204@egenix.com> On 08.10.2013 13:17, Serhiy Storchaka wrote: > Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if > a string has surrogate code. This mark has one of three possible states: > > * String doesn't contain surrogates. > * String contains surrogates. > * It is still unknown. > > We can combine this with "is_ascii" flag in 2-bit value: > > * String is ASCII-only (and doesn't contain surrogates). > * String is not ASCII-only and doesn't contain surrogates. > * String is not ASCII-only and contains surrogates. > * String is not ASCII-only and it is still unknown if it contains surrogate. > > By default a string is created in "unknown" state (if it is UCS2 or UCS4). After first request it > can be switched to "has surrogates" or "hasn't surrogates". State of the result of concatenating or > slicing can be determined from states of input strings. > > This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little faster UTF-8 encoding) > and converting to wchar_t* if string hasn't surrogates (this is true in most cases). I guess you could use one bit from the kind structure for that: /* Character size: - PyUnicode_WCHAR_KIND (0): * character type = wchar_t (16 or 32 bits, depending on the platform) - PyUnicode_1BYTE_KIND (1): * character type = Py_UCS1 (8 bits, unsigned) * all characters are in the range U+0000-U+00FF (latin1) * if ascii is set, all characters are in the range U+0000-U+007F (ASCII), otherwise at least one character is in the range U+0080-U+00FF - PyUnicode_2BYTE_KIND (2): * character type = Py_UCS2 (16 bits, unsigned) * all characters are in the range U+0000-U+FFFF (BMP) * at least one character is in the range U+0100-U+FFFF - PyUnicode_4BYTE_KIND (4): * character type = Py_UCS4 (32 bits, unsigned) * all characters are in the range U+0000-U+10FFFF * at least one character is in the range U+10000-U+10FFFF */ unsigned int kind:3; For some reason, it allocates 3 bits, but only 2 bits are used. The again, the state struct is unsigned int, so there's still plenty of room for extra flags. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2013) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From masklinn at masklinn.net Tue Oct 8 13:58:20 2013 From: masklinn at masklinn.net (Masklinn) Date: Tue, 8 Oct 2013 13:58:20 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net> Message-ID: On 2013-10-08, at 13:43 , Serhiy Storchaka wrote: > 08.10.13 14:38, Masklinn ???????(??): >> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote: >> >>> Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states: >>> >>> * String doesn't contain surrogates. >>> * String contains surrogates. >>> * It is still unknown. >>> >>> We can combine this with "is_ascii" flag in 2-bit value: >>> >>> * String is ASCII-only (and doesn't contain surrogates). >>> * String is not ASCII-only and doesn't contain surrogates. >>> * String is not ASCII-only and contains surrogates. >>> * String is not ASCII-only and it is still unknown if it contains surrogate. >> >> Isn't that redundant with the kind under shortest form representation? > > No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and '\udc80\U00010000' is UCS4 string with surrogate code. I don't know the details of the flexible string representation, but I believed the names fit what was actually in memory. UCS2 does not have surrogate pairs, thus surrogate codes make no sense in UCS2, they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not codepoints, they have no reason to appear in either UCS2 or UCS4 outside of encoding errors. > UCS2 string without surrogate codes can be encoded in UTF-16 by memcpy(). Surrogate codes prevent that (modulo objections above) for slicing (not that it's a big issue I think, a guard can just check whether it's slicing within a surrogate pair, that only requires checking the first and last 2 bytes of the range) but not for concatenation right? From victor.stinner at gmail.com Tue Oct 8 14:23:09 2013 From: victor.stinner at gmail.com (Victor Stinner) Date: Tue, 8 Oct 2013 14:23:09 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: Message-ID: I like the idea. I prefer to add another flag (1 bit), instead of having a complex with 4 different values. Your idea looks specific to the PEP 393, so I prefer to keep the flag private. Otherwise it would be hard for other implementations of Python to implement the function getting the flag value. Victor 2013/10/8 Serhiy Storchaka : > Here is an idea about adding a mark to PyUnicode object which allows fast > answer to the question if a string has surrogate code. This mark has one of > three possible states: > > * String doesn't contain surrogates. > * String contains surrogates. > * It is still unknown. > > We can combine this with "is_ascii" flag in 2-bit value: > > * String is ASCII-only (and doesn't contain surrogates). > * String is not ASCII-only and doesn't contain surrogates. > * String is not ASCII-only and contains surrogates. > * String is not ASCII-only and it is still unknown if it contains surrogate. > > By default a string is created in "unknown" state (if it is UCS2 or UCS4). > After first request it can be switched to "has surrogates" or "hasn't > surrogates". State of the result of concatenating or slicing can be > determined from states of input strings. > > This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little > faster UTF-8 encoding) and converting to wchar_t* if string hasn't > surrogates (this is true in most cases). From steve at pearwood.info Tue Oct 8 15:02:08 2013 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 9 Oct 2013 00:02:08 +1100 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

Message-ID: <20131008130208.GX7989@ando> On Tue, Oct 08, 2013 at 01:58:20PM +0200, Masklinn wrote: > > On 2013-10-08, at 13:43 , Serhiy Storchaka wrote: > > > 08.10.13 14:38, Masklinn ???????(??): > >> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote: > >> > >>> Here is an idea about adding a mark to PyUnicode object which > >>> allows fast answer to the question if a string has surrogate code. > >>> This mark has one of three possible states: [...] > >> Isn't that redundant with the kind under shortest form representation? > > > > No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and '\udc80\U00010000' is UCS4 string with surrogate code. > > I don't know the details of the flexible string representation, but I > believed the names fit what was actually in memory. UCS2 does not > have surrogate pairs, thus surrogate codes make no sense in UCS2, > they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not > codepoints, they have no reason to appear in either UCS2 or UCS4 > outside of encoding errors. I welcome correction, but I think you're mistaken. Python 3.3 strings don't have surrogate *pairs*, but they can contain surrogate *code points*. Unicode states: "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range." http://www.unicode.org/charts/PDF/UDC00.pdf http://www.unicode.org/charts/PDF/UD800.pdf So technically surrogates are "non-characters". That doesn't mean they are forbidden though; you can certainly create them, and encode them to UTF-16 and -32: py> surr = '\udc80' py> import unicodedata as ud py> ud.category(surr) 'Cs' py> surr.encode('utf-16') b'\xff\xfe\x80\xdc' py> surr.encode('utf-32') b'\xff\xfe\x00\x00\x80\xdc\x00\x00' However, you cannot encode single surrogates to UTF-8: py> surr.encode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed as per the standard: http://www.unicode.org/faq/utf_bom.html#utf8-5 I *think* you are supposed to be able to encode surrogate *pairs* to UTF-8, if I'm reading the FAQ correctly, but it seems Python 3.3 doesn't support that. In any case, it is certainly legal to have Unicode strings containing non-characters, including surrogates, and you can encode them to UTF-16 and -32. However, it looks like surrogates won't round trip in UTF-16, but they will in UTF-32: py> surr.encode('utf-16').decode('utf-16') == surr Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf16' codec can't decode bytes in position 2-3: unexpected end of data py> surr.encode('utf-32').decode('utf-32') == surr True So... I'm not sure why this will be useful. Presumably Unicode strings containing surrogate code points will be rare, and you can't encode them to UTF-8 at all, and you can't round trip them from UTF-16. -- Steven From stephen at xemacs.org Tue Oct 8 15:31:07 2013 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 08 Oct 2013 22:31:07 +0900 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

Message-ID: <87vc17khyc.fsf@uwakimon.sk.tsukuba.ac.jp> Masklinn writes: > I don't know the details of the flexible string representation, but I > believed the names fit what was actually in memory. UCS2 does not > have surrogate pairs, thus surrogate codes make no sense in UCS2, > they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not > codepoints, they have no reason to appear in either UCS2 or UCS4 > outside of encoding errors. True, but Python doesn't actually use UCS2 or UCS4 internally. It uses UCS2 or UCS4 plus a row of codes from the surrogate area to represent undecodable bytes. This feature is optional (enabled by using the appropriate error= setting in the codec), but I don't suppose it's going to go away. From masklinn at masklinn.net Tue Oct 8 15:48:18 2013 From: masklinn at masklinn.net (Masklinn) Date: Tue, 8 Oct 2013 15:48:18 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131008130208.GX7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> Message-ID: On 2013-10-08, at 15:02 , Steven D'Aprano wrote: [snipped early part as any response would be superseded by or redundant with the stuff below] > However, you cannot encode single surrogates to UTF-8: > > py> surr.encode('utf-8') > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in > position 0: surrogates not allowed > > as per the standard: > > http://www.unicode.org/faq/utf_bom.html#utf8-5 > > I *think* you are supposed to be able to encode surrogate *pairs* to > UTF-8, if I'm reading the FAQ correctly I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4: > there is a widespread practice of generating pairs of three byte > sequences in older software, especially software which pre-dates the > introduction of UTF-16 or that is interoperating with UTF-16 > environments under particular constraints. Such an encoding is not > conformant to UTF-8 as defined. Pairs of 3-byte sequences would be encoding each surrogate directly to UTF-8, whereas a single 4-byte sequence would be decoding the surrogate pair to a codepoint and encoding that codepoint to UTF-8. My reading of the FAQ makes the second interpretation the only valid one. So you can't encode surrogates (either lone or paired) to UTF-8, you can encode the codepoint encoded by a surrogate pair. > In any case, it is certainly legal to have Unicode strings > containing non-characters, including surrogates, and you can encode them > to UTF-16 and ?32. The UTF-32 section has similar note to UTF-8: http://www.unicode.org/faq/utf_bom.html#utf32-7 > A: If an unpaired surrogate is encountered when converting ill-formed > UTF-16 data, any conformant converter must treat this as an error. By > representing such an unpaired surrogate on its own, the resulting UTF-32 > data stream would become ill-formed. While it faithfully reflects the > nature of the input, Unicode conformance requires that encoding form > conversion always results in valid data stream. and the UTF-16 section points out: http://www.unicode.org/faq/utf_bom.html#utf16-7 > Q: Are there any 16-bit values that are invalid? > A: Unpaired surrogates are invalid in UTFs. These include any value in > the range D80016 to DBFF16 not followed by a value in the range DC0016 > to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a > value in the range D80016 to DBFF16. As far as I can read the FAQ, it is always invalid to encode a surrogate, surrogates are not to be considered codepoints (they're not just noncharacters[0], noncharacters are codepoints), and a lone surrogate in a UTF-16 stream means the stream is corrupted, which should result in an error during transcoding to anything (unless some recovery mode is used to replace corrupted characters by some mark during decoding I guess). > So... I'm not sure why this will be useful. Presumably Unicode strings > containing surrogate code points will be rare And they're a sign of corrupted stream. The FAQ reads a bit strangely, I think because it's written from the viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and UTF-32 are transcoding from that. Which does not apply to CPython and the FSR. Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) must not contain surrogate codes: a surrogate pair should have been decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a single surrogate should have been caught by the UTF-16 decoder and should have triggered the error handler at that point. A surrogate code in a CPython string means the string is corrupted[1]. Surrogates *may* appear in binary data, while building a UTF-16 bytestream by hand. [0] since "noncharacter" has a well-defined meaning in unicode, and only applies to 66 codepoints, a much smaller range than surrogates: http://www.unicode.org/faq/private_use.html#noncharacters [1] note that this hinges on my understanding of "UCS2" in FSR being actual UCS2, if it's UCS2-with-surrogates with a heuristic for switching between UCS2 and UCS4 depending on the number of surrogate pairs in the string it does not apply From steve at pearwood.info Tue Oct 8 16:20:09 2013 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 9 Oct 2013 01:20:09 +1100 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> Message-ID: <20131008142009.GY7989@ando> On Tue, Oct 08, 2013 at 03:48:18PM +0200, Masklinn wrote: > On 2013-10-08, at 15:02 , Steven D'Aprano wrote: > > py> surr.encode('utf-8') > > Traceback (most recent call last): > > File "", line 1, in > > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in > > position 0: surrogates not allowed > > > > as per the standard: > > > > http://www.unicode.org/faq/utf_bom.html#utf8-5 > > > > I *think* you are supposed to be able to encode surrogate *pairs* to > > UTF-8, if I'm reading the FAQ correctly > > I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4: > > > there is a widespread practice of generating pairs of three byte > > sequences in older software, especially software which pre-dates the > > introduction of UTF-16 or that is interoperating with UTF-16 > > environments under particular constraints. Such an encoding is not > > conformant to UTF-8 as defined. > > Pairs of 3-byte sequences would be encoding each surrogate directly to > UTF-8, whereas a single 4-byte sequence would be decoding the surrogate > pair to a codepoint and encoding that codepoint to UTF-8. My reading > of the FAQ makes the second interpretation the only valid one. It's not that clear to me. I fear the Unicode FAQs don't distinguish between Unicode strings and bytes well enough for my liking :( But for the record, my interpretion is that if you have a pair of code points constisting of the same values as a valid surrogate pair, you should be able to encode to UTF-8. To give a concrete example: Given: c = '\N{LINEAR B SYLLABLE B038 E}' # \U00010001 c.encode('utf-8') => b'\xf0\x90\x80\x81' and: c.encode('utf-16BE') # encodes as a surrogate pair => b'\xd8\x00\xdc\x01' then those same surrogates, taken as codepoints, should be encodable as UTF-8: '\ud800\udc01'.encode('utf-8') => b'\xf0\x90\x80\x81' I'd actually be disappointed if that were the case; I think that would be a poor design. But if that's what the Unicode standard demands, Python ought to support it. But hopefully somebody will explain to me why my interpretation is wrong :-) [...] > The FAQ reads a bit strangely, I think because it's written from the > viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and > UTF-32 are transcoding from that. Which does not apply to CPython and > the FSR. Hmmm... well, that might explain it. If it's written by Java programmers for Java programmers, they may very well decide that having spent 20 years trying to convince people that string != ASCII, they're now going to convince them that string == UTF-16 instead :/ > Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) > must not contain surrogate codes: a surrogate pair should have been > decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a > single surrogate should have been caught by the UTF-16 decoder and > should have triggered the error handler at that point. A surrogate code > in a CPython string means the string is corrupted[1]. I think that interpretation is a bit strong. I think it would be fair to say that CPython strings may contain surrogates, but you can't encode them to bytes using the UTFs. Nor are there any byte sequences that can be decoded to surrogates using the UTFs. This essentially means that you can only get surrogates in a string using (e.g.) chr() or \u escapes, and you can't then encode them to bytes using UTF encodings. > Surrogates *may* appear in binary data, while building a UTF-16 > bytestream by hand. But there you're talking about bytes, not byte strings. Byte strings can contain any bytes you like :-) -- Steven From turnbull at sk.tsukuba.ac.jp Tue Oct 8 16:31:25 2013 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Tue, 08 Oct 2013 23:31:25 +0900 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> Message-ID: <87txgrkf5u.fsf@uwakimon.sk.tsukuba.ac.jp> Masklinn writes: > The FAQ reads a bit strangely, I think because it's written from the > viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and > UTF-32 are transcoding from that. Which does not apply to CPython and > the FSR. No, it's written from the viewpoint that it says *nothing* about internal encodings, only about the encodings used in interchange of textual data, and about certain aspects of the processes that may receive and generate such data (eg, when data matches a Unicode regular expression, or how bidirectional text should appear visually). > Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) > must not contain surrogate codes: No, it says no such thing. All the Unicode Standard (and the FAQ) says is that if Python generates output that purports to be text encoded in Unicode, it may not contain surrogate codes except where those codes are used according to UTF-16 to encode characters in planes 2 to 17, and if it receives data alleged to be Unicode in some transformation format, it must raise an error if it receives surrogates other than a correctly formed surrogate pair in text known to be encoded as UTF-16. In fact (as I wrote before without proper citation), the internal encoding of Python has been extended by PEP 383 to use a subset of the surrogate space to represent undecodable bytes in an octet stream, when the error handler is set to "surrogateescape". Furthermore, there is nothing to stop a Python unicode from containing any code unit (including both surrogates and other non-characters like 0xFFFF). Checking of the rules you cite is done by codecs, at encoding and decoding time. From masklinn at masklinn.net Tue Oct 8 16:40:58 2013 From: masklinn at masklinn.net (Masklinn) Date: Tue, 8 Oct 2013 16:40:58 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131008142009.GY7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> Message-ID: On 2013-10-08, at 16:20 , Steven D'Aprano wrote > I'd actually be disappointed if that were the case; I think that would > be a poor design. But if that's what the Unicode standard demands, > Python ought to support it. That would be really weird, it'd mean an *encoder* has to translate a surrogate pair into the actual codepoint in some sort of weird UTF-specific normalization pass. > But hopefully somebody will explain to me why my interpretation is wrong > :-) > > [...] >> The FAQ reads a bit strangely, I think because it's written from the >> viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and >> UTF-32 are transcoding from that. Which does not apply to CPython and >> the FSR. > > Hmmm... well, that might explain it. If it's written by Java programmers > for Java programmers, they may very well decide that having spent 20 > years trying to convince people that string != ASCII, they're now > going to convince them that string == UTF-16 instead :/ To be fair, it's not just java programmers, IIRC ICU uses UTF-16 as the internal encoding. >> Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) >> must not contain surrogate codes: a surrogate pair should have been >> decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a >> single surrogate should have been caught by the UTF-16 decoder and >> should have triggered the error handler at that point. A surrogate code >> in a CPython string means the string is corrupted[1]. > > I think that interpretation is a bit strong. I think it would be fair to > say that CPython strings may contain surrogates, but you can't encode > them to bytes using the UTFs. Nor are there any byte sequences that can > be decoded to surrogates using the UTFs. > > This essentially means that you can only get surrogates in a string > using (e.g.) chr() or \u escapes, and you can't then encode them to > bytes using UTF encodings. > >> Surrogates *may* appear in binary data, while building a UTF-16 >> bytestream by hand. > > But there you're talking about bytes, not byte strings. Byte strings can > contain any bytes you like :-) Yes, that's basically what I mean: I think surrogates only make sense in a bytestream, not in a unicode stream. Although I did not remember/was not aware of PEP 383 (thank you Stephen) which makes the Unicode spec irrelevant to what Python string contains. On 2013-10-08, at 16:31 , Stephen J. Turnbull wrote: > Furthermore, there is nothing to stop a Python unicode from containing > any code unit (including both surrogates and other non-characters like > 0xFFFF). Checking of the rules you cite is done by codecs, at > encoding and decoding time. noncharacters are a very different case for what it's worth, their own FAQ clearly notes that they are valid full-fledged codepoints and must be encoded and preserved by UTFs: http://www.unicode.org/faq/private_use.html#nonchar7 From random832 at fastmail.us Tue Oct 8 17:27:52 2013 From: random832 at fastmail.us (random832 at fastmail.us) Date: Tue, 08 Oct 2013 11:27:52 -0400 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

Message-ID: <1381246072.12709.31490813.0B5674DF@webmail.messagingengine.com> On Tue, Oct 8, 2013, at 7:58, Masklinn wrote: > I don't know the details of the flexible string representation, but I > believed the names fit what was actually in memory. UCS2 does not > have surrogate pairs, thus surrogate codes make no sense in UCS2, > they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not > codepoints, they have no reason to appear in either UCS2 or UCS4 > outside of encoding errors. They can also occur due to slicing a ctypes unicode buffer, due to PEP 383, or due to native UTF-16 filenames that contain invalid surrogates. The latter two also create situations where you need to generate them. From storchaka at gmail.com Tue Oct 8 17:55:25 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 08 Oct 2013 18:55:25 +0300 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131008130208.GX7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> Message-ID: 08.10.13 16:02, Steven D'Aprano ???????(??): > So... I'm not sure why this will be useful. This is a bug. http://bugs.python.org/issue12892 From storchaka at gmail.com Tue Oct 8 18:16:57 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 08 Oct 2013 19:16:57 +0300 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <5253F348.3010204@egenix.com> References: <5253F348.3010204@egenix.com> Message-ID: 08.10.13 14:58, M.-A. Lemburg ???????(??): > I guess you could use one bit from the kind structure > for that: The kind of string should be equal to the size of character unit. This assumption is used in a lot of code. From storchaka at gmail.com Tue Oct 8 18:21:57 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 08 Oct 2013 19:21:57 +0300 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References:

Message-ID: 08.10.13 15:23, Victor Stinner ???????(??): > I like the idea. I prefer to add another flag (1 bit), instead of > having a complex with 4 different values. We need at least 3-states value: yes, no, may be. But combining with is_ascii flag we need only one additional bit. I think that it shouldn't be more complex. > Your idea looks specific to the PEP 393, so I prefer to keep the flag > private. Otherwise it would be hard for other implementations of > Python to implement the function getting the flag value. Yes, of course. From mal at egenix.com Tue Oct 8 18:28:51 2013 From: mal at egenix.com (M.-A. Lemburg) Date: Tue, 08 Oct 2013 18:28:51 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <5253F348.3010204@egenix.com> Message-ID: <525432C3.3070905@egenix.com> On 08.10.2013 18:16, Serhiy Storchaka wrote: > 08.10.13 14:58, M.-A. Lemburg ???????(??): >> I guess you could use one bit from the kind structure >> for that: > > The kind of string should be equal to the size of character unit. This assumption is used in a lot > of code. Ok, then just add the flag to the end of the list... we'd still have at least 7 bits left on most platforms, IICC. PS: I guess this use of kind should be documented clearly somewhere. The unicodeobject.h file only hints at this and for PyUnicode_WCHAR_KIND this interpretation cannot be used. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2013) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From bruce at leapyear.org Tue Oct 8 22:37:54 2013 From: bruce at leapyear.org (Bruce Leban) Date: Tue, 8 Oct 2013 13:37:54 -0700 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131008142009.GY7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> Message-ID: On Tue, Oct 8, 2013 at 7:20 AM, Steven D'Aprano wrote: > Given: > > c = '\N{LINEAR B SYLLABLE B038 E}' # \U00010001 > c.encode('utf-8') > => b'\xf0\x90\x80\x81' > > and: > > c.encode('utf-16BE') # encodes as a surrogate pair > => b'\xd8\x00\xdc\x01' > > then those same surrogates, taken as codepoints, should be encodable as > UTF-8: > > '\ud800\udc01'.encode('utf-8') > => b'\xf0\x90\x80\x81' > > > I'd actually be disappointed if that were the case; I think that would > be a poor design. But if that's what the Unicode standard demands, > Python ought to support it. > The FAQ is explicit that this is wrong: "The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4 It goes on to say that there is a widespread practice of doing it anyway in older software. Therefore, it might be acceptable to accept these mis-encoded characters when *decoding* but they should never be generated when *encoding*. I'd prefer not to have that on by default given the history of overlong UTF-8 bugs (e.g., see http://blogs.msdn.com/b/michael_howard/archive/2008/08/22/overlong-utf-8-escapes-bite.aspx). Essentially if different decoders follow different rules, then you can sometimes sneak stuff through the permissive decoders. Notwithstanding that, there is a different unicode encoding CESU-8 which does the opposite: it always encodes those characters requiring surrogate pairs as 6 bytes consisting of two UTF-8-style encodings of the individual surrogate codepoints. Python doesn't support this and the request to support it was rejected: http://bugs.python.org/issue12742 --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.ewing at canterbury.ac.nz Wed Oct 9 00:49:29 2013 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 09 Oct 2013 11:49:29 +1300 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> Message-ID: <52548BF9.4070802@canterbury.ac.nz> Bruce Leban wrote: > The FAQ is explicit that this is wrong: "The definition of UTF-8 > requires that supplementary characters (those using surrogate pairs in > UTF-16) be encoded with a single four byte > sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4 Python's internal string representation is not UTF-16, though, so this doesn't apply directly. Seems to me it hinges on whether a pair of surrogate code points appearing in a Python string are meant to represent a single character or not. I would say not, because otherwise they would have been stored as a single code unit. -- Greg From steve at pearwood.info Wed Oct 9 02:55:07 2013 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 9 Oct 2013 11:55:07 +1100 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> Message-ID: <20131009005507.GB7989@ando> On Tue, Oct 08, 2013 at 01:37:54PM -0700, Bruce Leban wrote: > On Tue, Oct 8, 2013 at 7:20 AM, Steven D'Aprano wrote: > > > Given: > > > > c = '\N{LINEAR B SYLLABLE B038 E}' # \U00010001 > > c.encode('utf-8') > > => b'\xf0\x90\x80\x81' > > > > and: > > > > c.encode('utf-16BE') # encodes as a surrogate pair > > => b'\xd8\x00\xdc\x01' > > > > then those same surrogates, taken as codepoints, should be encodable as > > UTF-8: > > > > '\ud800\udc01'.encode('utf-8') > > => b'\xf0\x90\x80\x81' > > > > > > I'd actually be disappointed if that were the case; I think that would > > be a poor design. But if that's what the Unicode standard demands, > > Python ought to support it. > > > > The FAQ is explicit that this is wrong: "The definition of UTF-8 requires > that supplementary characters (those using surrogate pairs in UTF-16) be > encoded with a single four byte sequence." > http://www.unicode.org/faq/utf_bom.html#utf8-4 And if you count the number of bytes, you will see four of them: '\ud800\udc01'.encode('utf-8') => b'\xf0' b'\x90' b'\x80' b'\x81' I stress that Python 3.3 doesn't actually do this, but my reading of the FAQ suggests that it should. The question isn't what UTF-8 should do with supplmentary characters (those outside the BMP). That is well-defined, and Python 3.3 gets it right. The question is what it should do with pairs of surrogates. Ill-formed surrogates are rightly illegal when encoding to UTF-8: # a lone surrogate is illegal '\ud800'.encode('utf-8') must be treated as an error # two high surrogates, or two low surrogates '\udc01\udc01'.encode('utf-8') must be treated as an error '\ud800\ud800'.encode('utf-8') must be treated as an error # if they're in the wrong order '\udc01\ud800'.encode('utf-8') must be treated as an error The only thing that I'm not sure is how to deal with *valid* pairs of surrogates: '\ud800\udc01'.encode('utf-8') should do what? I personally would hope that this too should raise, which is Python's current behaviour, but my reading of the FAQs is that it should be treated as if there were an implicit UTF-16 conversion. (I hope I'm wrong!) That is: 1) treat the sequence of code points as if it were a sequence of two 16-bit values b'\xd8\x00' b'\xdc\x01' 2) implicitly decode it using UTF-16 to get U+10001 3) encode U+10001 using UTF-8 to get b'\xf0\x90\x80\x81' That would be (in my opinion) *horrible*, but that's my reading of the Unicode FAQ. The question asks: "How do I convert a UTF-16 surrogate pair such as to UTF-8?" and the answer seems to be: "The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence." which doesn't actually answer the question (the question is about SURROGATE PAIRS, the answer is about SUPPLEMENTARY CHARACTERS) but suggests the above horrible interpretation. What I'm hoping for is a definite source that explains what the UTF-8 encoder is supposed to do with a Unicode string containing surrogates. (And presumably the other UTF encoders as well, although I haven't tried thinking about them yet.) > It goes on to say that there is a widespread practice of doing it anyway in > older software. Therefore, it might be acceptable to accept these > mis-encoded characters when *decoding* but they should never be generated > when *encoding*. They are talking about the practice of generating six bytes, two three-byte sequences. You should notice that I'm not generating six bytes anywhere. -- Steven From stephen at xemacs.org Wed Oct 9 04:03:46 2013 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 09 Oct 2013 11:03:46 +0900 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <52548BF9.4070802@canterbury.ac.nz> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> <52548BF9.4070802@canterbury.ac.nz> Message-ID: <87eh7vjj3x.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > Bruce Leban wrote: > > The FAQ is explicit that this is wrong: "The definition of UTF-8 > > requires that supplementary characters (those using surrogate pairs in > > UTF-16) be encoded with a single four byte > > sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4 > > Python's internal string representation is not UTF-16, though, > so this doesn't apply directly. It applies directly to Steven's examples, since they use .encode() and .decode(). > Seems to me it hinges on whether a pair of surrogate code > points appearing in a Python string are meant to represent > a single character or not. Only (a subset of) low surrogates is valid in a Python string, so a pair can't possibly respresent a supplementary character in UTF-16 encoding. From tjreedy at udel.edu Wed Oct 9 04:43:54 2013 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 08 Oct 2013 22:43:54 -0400 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131009005507.GB7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> <20131009005507.GB7989@ando> Message-ID: On 10/8/2013 8:55 PM, Steven D'Aprano wrote: > '\ud800\udc01'.encode('utf-8') > => b'\xf0' b'\x90' b'\x80' b'\x81' > > I stress that Python 3.3 doesn't actually do this, but my reading of the > FAQ suggests that it should. And I already explained on python-list why that reading is wrong; transcoding a utf-16 string (sequence of 2-byte words, subject to validity rules) is different from encoding unicode text (character sequence, and surrogates are not characters). A utf-16 to utf-8 transcoder should (must) do the above, but in 3.3+, the utf-8 codec is no longer the utf-16 trancoder that it effectively was for narrow builds. Each utf form defines a one to one mapping between unicode texts and valid code unit sequences. (Unicode Standard, Chapter 3, definition D79.) Having both '\U00010001' and '\ud800\udc01' map to b'\xf0\x90\x80\x81' would violate that important property. '\ud800\udc01' represents a character in utf-16 but not in python's flexible string representation. The latter uses one code unit (of variable size per string) per character, instead of a variable number of code units (of one size for all strings) per character. Because machines have not conceptual, visual, or aural memory, but only byte memory, they must re-encode abstract characters to bytes to remember them. In pre 3.3 narrow builds, where utf-16 was used internally, decoding and encoding amounted to transcoding bytes encodings into the utf-16 encoding, and vice versa. So utf-8 b'\xf0\x90\x80\x81' and utf-16 '\ud800\udc01' were mapped into each other. Whether the mapping was done directly or indirectly, via the character codepoint value, did not matter to the user. In any case FSR no longer uses multiple-code-unit encodings internally, and '\ud800\udc01', even though allowed for practical reasons, does not represent and is not the same as '\U00010001'. The proposed 'has_surrogates' flag amounts to an 'not strictly valid' flag. Only the FSR implementors can decide if it is worth the trouble. -- Terry Jan Reedy From stephen at xemacs.org Wed Oct 9 06:29:04 2013 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 09 Oct 2013 13:29:04 +0900 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131009005507.GB7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> <20131009005507.GB7989@ando> Message-ID: <87a9ijjcdr.fsf@uwakimon.sk.tsukuba.ac.jp> Steven D'Aprano writes: > What I'm hoping for is a definite source that explains what the UTF-8 > encoder is supposed to do with a Unicode string containing > surrogates. According to PEP 383, which provides a special mechanism for roundtripping input that claims to be a particular encoding but does not conform to that encoding, when encoding to UTF-8, if the errors= parameter *is* surrogateescape *and* the value is in the first row of the low surrogate range, it is masked by 0xff and emitted as a single byte. In all other cases of surrogates, it should raise an error. A conforming Unicode codec must not emit UTF-8 which would decode to a surrogate. These cases can occur in valid Python programs because chr() is unconstrained (for example). On input, Unicode conformance means that when using the surrogateescape handler, an alleged UTF-8 stream containing a 6-byte sequence that would algorithmically decode to a surrogate pair should be represented internally as a sequence of 6 surrogates from the first row of the low surrogate range. If the surrogateescape handler is not in use, it should raise an error. Sorry about not testing actual behavior, gotta run to a meeting. I forget what PEP 383 says about other Unicode codecs. From bruce at leapyear.org Wed Oct 9 04:09:11 2013 From: bruce at leapyear.org (Bruce Leban) Date: Tue, 8 Oct 2013 19:09:11 -0700 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: <20131009005507.GB7989@ando> References: <70EDEE78-A85F-4558-A940-32E72DAC8F2C@masklinn.net>

<20131008130208.GX7989@ando> <20131008142009.GY7989@ando> <20131009005507.GB7989@ando> Message-ID: Sorry. I don't think what I said contributed to the conversation very well. Let me try again. On Tue, Oct 8, 2013 at 5:55 PM, Steven D'Aprano wrote: > On Tue, Oct 08, 2013 at 01:37:54PM -0700, Bruce Leban wrote: > > The question isn't what UTF-8 should do with supplmentary characters > (those outside the BMP). That is well-defined, and Python 3.3 gets it > right. The question is what it should do with pairs of surrogates. > Ill-formed surrogates are rightly illegal when encoding to UTF-8: > > The only thing that I'm not sure is how to deal with *valid* > pairs of surrogates: > > '\ud800\udc01'.encode('utf-8') should do what? > > I don't think that's valid. While it is a sequence of Unicode *codepoints *(Python definition of unicode string) it is not a sequence of Unicode * characters*. Arguably, Python should insist that a Unicode string be a sequence of Unicode characters and reject '\ud800\udc01' at compile time just as it does '\U01010101' as those are all not valid Unicode characters. However, I concede that is unlikely to happen. Here's how I read the FAQ. Most of this FAQ is written in terms of converting one representation to another. Python strings are not one of those representations. A *Unicode transformation format* (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. http://www.unicode.org/faq/utf_bom.html#gen2 To convert UTF-X to UTF-Y, you convert the UTF-X to a sequence of characters and then convert that to UTF-Y. Note that this excludes surrogate code points -- they are not representable in the sequence of code points that a UTF defines. The definition of UTF-32 says: Any Unicode character can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. http://www.unicode.org/faq/utf_bom.html#utf32-1 Thus a surrogate codepoint is NOT allowed in UTF-32 as it is not a character and if it is encountered it should be treated as an error. --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.stinner at gmail.com Fri Oct 11 14:12:37 2013 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 11 Oct 2013 14:12:37 +0200 Subject: [Python-ideas] Add "has_surrogates" flags to string object In-Reply-To: References: Message-ID: 2013/10/8 Serhiy Storchaka : > Here is an idea about adding a mark to PyUnicode object which allows fast > answer to the question if a string has surrogate code. This mark has one of > three possible states: > > * String doesn't contain surrogates. > * String contains surrogates. > * It is still unknown. > > We can combine this with "is_ascii" flag in 2-bit value: > > * String is ASCII-only (and doesn't contain surrogates). > * String is not ASCII-only and doesn't contain surrogates. > * String is not ASCII-only and contains surrogates. > * String is not ASCII-only and it is still unknown if it contains surrogate. > > By default a string is created in "unknown" state (if it is UCS2 or UCS4). > After first request it can be switched to "has surrogates" or "hasn't > surrogates". State of the result of concatenating or slicing can be > determined from states of input strings. > > This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little > faster UTF-8 encoding) and converting to wchar_t* if string hasn't > surrogates (this is true in most cases). Knowing if a string contains any surrogate character would also speedup marshal and pickle modules: http://bugs.python.org/issue19219#msg199465 Victor From mistersheik at gmail.com Fri Oct 11 20:29:43 2013 From: mistersheik at gmail.com (Neil Girdhar) Date: Fri, 11 Oct 2013 11:29:43 -0700 (PDT) Subject: [Python-ideas] An exhaust() function for iterators In-Reply-To: References: Message-ID: <5a7a21a5-bd7e-4bc7-a80f-e6d6154f0e13@googlegroups.com> This was also my thought. On Sunday, September 29, 2013 4:42:20 PM UTC-4, Serhiy Storchaka wrote: > > 29.09.13 07:06, Clay Sweetser ???????(??): > > I would like to propose that this function, or one very similar to it, > > be added to the standard library, either in the itertools module, or > > the standard namespace. > > If nothing else, doing so would at least give a single *obvious* way > > to exhaust an iterator, instead of the several miscellaneous methods > > available. > > I prefer optimize the for loop so that it will be most efficient way (it > is already most obvious way). > > _______________________________________________ > Python-ideas mailing list > Python... at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mertz at gnosis.cx Fri Oct 11 20:51:20 2013 From: mertz at gnosis.cx (David Mertz) Date: Fri, 11 Oct 2013 11:51:20 -0700 Subject: [Python-ideas] An exhaust() function for iterators In-Reply-To: <5a7a21a5-bd7e-4bc7-a80f-e6d6154f0e13@googlegroups.com> References: <5a7a21a5-bd7e-4bc7-a80f-e6d6154f0e13@googlegroups.com> Message-ID: It is hard to imagine that doing this: for _ in side_effect_iter: pass Could EVER realistically spend a significant share of its time in the loop code. Side effects almost surely need to do something that vastly overpowers the cost of the loop itself (maybe some I/O, maybe some computation), or there's no point in using a side-effect iterator. I know you *could* technically write: def side_effect_iter(N, obj): for n in range(N): obj.val = n yield True And probably something else whose only side effect was changing some value that doesn't need real computation. But surely writing that and exhausting that iterator is NEVER the best way to code such a thing. On the other hand, a more realistic one like this: def side_effect_iter(N): for n in range(N): val = complex_computation(n) write_to_slow_disk(val) yield True Is going to take a long time in each iteration, and there's no reason to care that the loop isn't absolutely optimal speed. On Fri, Oct 11, 2013 at 11:29 AM, Neil Girdhar wrote: > This was also my thought. > > > On Sunday, September 29, 2013 4:42:20 PM UTC-4, Serhiy Storchaka wrote: > >> 29.09.13 07:06, Clay Sweetser ???????(??): >> > I would like to propose that this function, or one very similar to it, >> > be added to the standard library, either in the itertools module, or >> > the standard namespace. >> > If nothing else, doing so would at least give a single *obvious* way >> > to exhaust an iterator, instead of the several miscellaneous methods >> > available. >> >> I prefer optimize the for loop so that it will be most efficient way (it >> is already most obvious way). >> >> ______________________________**_________________ >> Python-ideas mailing list >> Python... at python.org >> https://mail.python.org/**mailman/listinfo/python-ideas >> >> > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Fri Oct 11 21:02:42 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Fri, 11 Oct 2013 22:02:42 +0300 Subject: [Python-ideas] An exhaust() function for iterators In-Reply-To: References: <5a7a21a5-bd7e-4bc7-a80f-e6d6154f0e13@googlegroups.com> Message-ID: 11.10.13 21:51, David Mertz ???????(??): > It is hard to imagine that doing this: > > for _ in side_effect_iter: pass > > Could EVER realistically spend a significant share of its time in the > loop code. When I written a test for tee() (issue #13454) I needed very fast iterator exhausting. There were one or two other similar cases. From mistersheik at gmail.com Fri Oct 11 20:38:33 2013 From: mistersheik at gmail.com (Neil Girdhar) Date: Fri, 11 Oct 2013 11:38:33 -0700 (PDT) Subject: [Python-ideas] Extremely weird itertools.permutations Message-ID: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com> "It is universally agreed that a list of n distinct symbols has n! permutations. However, when the symbols are not distinct, the most common convention, in mathematics and elsewhere, seems to be to count only distinct permutations." ? http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original. Should we consider fixing itertools.permutations and to output only unique permutations (if possible, although I realize that would break code). It is completely non-obvious to have permutations returning duplicates. For a non-breaking compromise what about adding a flag? Best, Neil -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Fri Oct 11 21:29:35 2013 From: storchaka at gmail.com (Serhiy Storchaka) Date: Fri, 11 Oct 2013 22:29:35 +0300 Subject: [Python-ideas] Extremely weird itertools.permutations In-Reply-To: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com> References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com> Message-ID: 11.10.13 21:38, Neil Girdhar ???????(??): > Should we consider fixing itertools.permutations and to output only > unique permutations (if possible, although I realize that would break > code). It is completely non-obvious to have permutations returning > duplicates. For a non-breaking compromise what about adding a flag? I think this should be separated function. From mertz at gnosis.cx Fri Oct 11 22:02:11 2013 From: mertz at gnosis.cx (David Mertz) Date: Fri, 11 Oct 2013 13:02:11 -0700 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com> Message-ID: What would you like this hypothetical function to output here: >>> from itertools import permutations >>> from decimal import Decimal as D >>> from fractions import Fraction as F >>> items = (3, 3.0, D(3), F(3,1), "aa", "AA".lower(), "a"+"a") >>> list(permutations(items)) It's neither QUITE equality nor identity you are looking for, I think, in nonredundant_permutation(): >> "aa" == "AA".lower(), "aa" is "AA".lower() (True, False) >>> "aa" == "a"+"a", "aa" is "a"+"a" (True, True) >>> D(3) == 3.0, D(3) is 3.0 (True, False) On Fri, Oct 11, 2013 at 11:38 AM, Neil Girdhar wrote: > "It is universally agreed that a list of n distinct symbols has n! > permutations. However, when the symbols are not distinct, the most common > convention, in mathematics and elsewhere, seems to be to count only > distinct permutations." ? > http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original > . > > > Should we consider fixing itertools.permutations and to output only unique > permutations (if possible, although I realize that would break code). It is > completely non-obvious to have permutations returning duplicates. For a > non-breaking compromise what about adding a flag? > > Best, > Neil > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abarnert at yahoo.com Fri Oct 11 22:19:22 2013 From: abarnert at yahoo.com (Andrew Barnert) Date: Fri, 11 Oct 2013 13:19:22 -0700 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com>

Message-ID: <01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> I think equality is perfectly reasonable here. The fact that {3.0, 3} only has one member seems like the obvious precedent to follow here. Sent from a random iPhone On Oct 11, 2013, at 13:02, David Mertz wrote: > What would you like this hypothetical function to output here: > > >>> from itertools import permutations > >>> from decimal import Decimal as D > >>> from fractions import Fraction as F > >>> items = (3, 3.0, D(3), F(3,1), "aa", "AA".lower(), "a"+"a") > >>> list(permutations(items)) > > It's neither QUITE equality nor identity you are looking for, I think, in nonredundant_permutation(): > > >> "aa" == "AA".lower(), "aa" is "AA".lower() > (True, False) > >>> "aa" == "a"+"a", "aa" is "a"+"a" > (True, True) > >>> D(3) == 3.0, D(3) is 3.0 > (True, False) > > On Fri, Oct 11, 2013 at 11:38 AM, Neil Girdhar wrote: >> "It is universally agreed that a list of n distinct symbols has n! permutations. However, when the symbols are not distinct, the most common convention, in mathematics and elsewhere, seems to be to count only distinct permutations." ? http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original. >> >> >> Should we consider fixing itertools.permutations and to output only unique permutations (if possible, although I realize that would break code). It is completely non-obvious to have permutations returning duplicates. For a non-breaking compromise what about adding a flag? >> >> Best, >> Neil >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas > > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon.brandvein at gmail.com Fri Oct 11 23:19:38 2013 From: jon.brandvein at gmail.com (Jonathan Brandvein) Date: Fri, 11 Oct 2013 17:19:38 -0400 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: <01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com>

<01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> Message-ID: I think it's fair to use {3.0, 3} as precedent. But note that transitivity is not required by the __eq__() method. In cases of intransitive equality (A == B == C but not A == C), I imagine the result should be ill-defined in the same way that sorting is when the key function is inconsistent. Jon On Fri, Oct 11, 2013 at 4:19 PM, Andrew Barnert wrote: > I think equality is perfectly reasonable here. The fact that {3.0, 3} only > has one member seems like the obvious precedent to follow here. > > Sent from a random iPhone > > On Oct 11, 2013, at 13:02, David Mertz wrote: > > What would you like this hypothetical function to output here: > > >>> from itertools import permutations > >>> from decimal import Decimal as D > >>> from fractions import Fraction as F > >>> items = (3, 3.0, D(3), F(3,1), "aa", "AA".lower(), "a"+"a") > >>> list(permutations(items)) > > It's neither QUITE equality nor identity you are looking for, I think, in > nonredundant_permutation(): > > >> "aa" == "AA".lower(), "aa" is "AA".lower() > (True, False) > >>> "aa" == "a"+"a", "aa" is "a"+"a" > (True, True) > >>> D(3) == 3.0, D(3) is 3.0 > (True, False) > > On Fri, Oct 11, 2013 at 11:38 AM, Neil Girdhar wrote: > >> "It is universally agreed that a list of n distinct symbols has n! >> permutations. However, when the symbols are not distinct, the most common >> convention, in mathematics and elsewhere, seems to be to count only >> distinct permutations." ? >> http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original >> . >> >> >> Should we consider fixing itertools.permutations and to output only >> unique permutations (if possible, although I realize that would break >> code). It is completely non-obvious to have permutations returning >> duplicates. For a non-breaking compromise what about adding a flag? >> >> Best, >> Neil >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> >> > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at mrabarnett.plus.com Fri Oct 11 23:25:56 2013 From: python at mrabarnett.plus.com (MRAB) Date: Fri, 11 Oct 2013 22:25:56 +0100 Subject: [Python-ideas] Extremely weird itertools.permutations In-Reply-To: References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com> Message-ID: <52586CE4.9030002@mrabarnett.plus.com> On 11/10/2013 20:29, Serhiy Storchaka wrote: > 11.10.13 21:38, Neil Girdhar ???????(??): >> Should we consider fixing itertools.permutations and to output only >> unique permutations (if possible, although I realize that would break >> code). It is completely non-obvious to have permutations returning >> duplicates. For a non-breaking compromise what about adding a flag? > > I think this should be separated function. > +1 From mertz at gnosis.cx Fri Oct 11 22:27:34 2013 From: mertz at gnosis.cx (David Mertz) Date: Fri, 11 Oct 2013 13:27:34 -0700 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: <01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com>

<01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> Message-ID: Andrew & Neil (or whoever): Is this *really* what you want: >>> from itertools import permutations >>> def nonredundant_permutations(seq): ... return list(set(permutations(seq))) ... >>> pprint(list(permutations([F(3,1), D(3.0), 3.0]))) [(Fraction(3, 1), Decimal('3'), 3.0), (Fraction(3, 1), 3.0, Decimal('3')), (Decimal('3'), Fraction(3, 1), 3.0), (Decimal('3'), 3.0, Fraction(3, 1)), (3.0, Fraction(3, 1), Decimal('3')), (3.0, Decimal('3'), Fraction(3, 1))] >>> pprint(list(nonredundant_permutations([F(3,1), D(3.0), 3.0]))) [(Fraction(3, 1), Decimal('3'), 3.0)] It seems odd to me to want that. On the other hand, I provide a one-line implementation of the desired behavior if anyone wants it. Moreover, I don't think the runtime behavior of my one-liner is particularly costly... maybe not the best possible, but the best big-O possible. On Fri, Oct 11, 2013 at 1:19 PM, Andrew Barnert wrote: > I think equality is perfectly reasonable here. The fact that {3.0, 3} only > has one member seems like the obvious precedent to follow here. > > Sent from a random iPhone > > On Oct 11, 2013, at 13:02, David Mertz wrote: > > What would you like this hypothetical function to output here: > > >>> from itertools import permutations > >>> from decimal import Decimal as D > >>> from fractions import Fraction as F > >>> items = (3, 3.0, D(3), F(3,1), "aa", "AA".lower(), "a"+"a") > >>> list(permutations(items)) > > It's neither QUITE equality nor identity you are looking for, I think, in > nonredundant_permutation(): > > >> "aa" == "AA".lower(), "aa" is "AA".lower() > (True, False) > >>> "aa" == "a"+"a", "aa" is "a"+"a" > (True, True) > >>> D(3) == 3.0, D(3) is 3.0 > (True, False) > > On Fri, Oct 11, 2013 at 11:38 AM, Neil Girdhar wrote: > >> "It is universally agreed that a list of n distinct symbols has n! >> permutations. However, when the symbols are not distinct, the most common >> convention, in mathematics and elsewhere, seems to be to count only >> distinct permutations." ? >> http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original >> . >> >> >> Should we consider fixing itertools.permutations and to output only >> unique permutations (if possible, although I realize that would break >> code). It is completely non-obvious to have permutations returning >> duplicates. For a non-breaking compromise what about adding a flag? >> >> Best, >> Neil >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> >> > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mistersheik at gmail.com Fri Oct 11 23:35:41 2013 From: mistersheik at gmail.com (Neil Girdhar) Date: Fri, 11 Oct 2013 17:35:41 -0400 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com>

<01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> Message-ID: > Moreover, I don't think the runtime behavior of my one-liner is particularly costly? It is *extremely* costly. There can be n! permutations, so for even, say, 12 elements, you are looking at many gigabytes of memory needlessly used. One big motivator for itertools is not to have to do this. I'm curious how you would solve this problem: https://www.kattis.com/problems/industrialspy efficiently in Python. I did it by using a unique-ifying generator, but ideally this would not be necessary. Ideally, Python would do exactly what C++ does with next_permutation. Best, Neil On Fri, Oct 11, 2013 at 4:27 PM, David Mertz wrote: > Andrew & Neil (or whoever): > > Is this *really* what you want: > > >>> from itertools import permutations > >>> def nonredundant_permutations(seq): > ... return list(set(permutations(seq))) > ... > >>> pprint(list(permutations([F(3,1), D(3.0), 3.0]))) > [(Fraction(3, 1), Decimal('3'), 3.0), > (Fraction(3, 1), 3.0, Decimal('3')), > (Decimal('3'), Fraction(3, 1), 3.0), > (Decimal('3'), 3.0, Fraction(3, 1)), > (3.0, Fraction(3, 1), Decimal('3')), > (3.0, Decimal('3'), Fraction(3, 1))] > > >>> pprint(list(nonredundant_permutations([F(3,1), D(3.0), 3.0]))) > [(Fraction(3, 1), Decimal('3'), 3.0)] > > It seems odd to me to want that. On the other hand, I provide a one-line > implementation of the desired behavior if anyone wants it. Moreover, I > don't think the runtime behavior of my one-liner is particularly costly... > maybe not the best possible, but the best big-O possible. > > > > On Fri, Oct 11, 2013 at 1:19 PM, Andrew Barnert wrote: > >> I think equality is perfectly reasonable here. The fact that {3.0, 3} >> only has one member seems like the obvious precedent to follow here. >> >> Sent from a random iPhone >> >> On Oct 11, 2013, at 13:02, David Mertz wrote: >> >> What would you like this hypothetical function to output here: >> >> >>> from itertools import permutations >> >>> from decimal import Decimal as D >> >>> from fractions import Fraction as F >> >>> items = (3, 3.0, D(3), F(3,1), "aa", "AA".lower(), "a"+"a") >> >>> list(permutations(items)) >> >> It's neither QUITE equality nor identity you are looking for, I think, in >> nonredundant_permutation(): >> >> >> "aa" == "AA".lower(), "aa" is "AA".lower() >> (True, False) >> >>> "aa" == "a"+"a", "aa" is "a"+"a" >> (True, True) >> >>> D(3) == 3.0, D(3) is 3.0 >> (True, False) >> >> On Fri, Oct 11, 2013 at 11:38 AM, Neil Girdhar wrote: >> >>> "It is universally agreed that a list of n distinct symbols has n! >>> permutations. However, when the symbols are not distinct, the most common >>> convention, in mathematics and elsewhere, seems to be to count only >>> distinct permutations." ? >>> http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original >>> . >>> >>> >>> Should we consider fixing itertools.permutations and to output only >>> unique permutations (if possible, although I realize that would break >>> code). It is completely non-obvious to have permutations returning >>> duplicates. For a non-breaking compromise what about adding a flag? >>> >>> Best, >>> Neil >>> >>> _______________________________________________ >>> Python-ideas mailing list >>> Python-ideas at python.org >>> https://mail.python.org/mailman/listinfo/python-ideas >>> >>> >> >> >> -- >> Keeping medicines from the bloodstreams of the sick; food >> from the bellies of the hungry; books from the hands of the >> uneducated; technology from the underdeveloped; and putting >> advocates of freedom in prisons. Intellectual property is >> to the 21st century what the slave trade was to the 16th. >> >> >> >> -- >> Keeping medicines from the bloodstreams of the sick; food >> from the bellies of the hungry; books from the hands of the >> uneducated; technology from the underdeveloped; and putting >> advocates of freedom in prisons. Intellectual property is >> to the 21st century what the slave trade was to the 16th. >> >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas at python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> >> > > > -- > Keeping medicines from the bloodstreams of the sick; food > from the bellies of the hungry; books from the hands of the > uneducated; technology from the underdeveloped; and putting > advocates of freedom in prisons. Intellectual property is > to the 21st century what the slave trade was to the 16th. > > _______________________________________________ > Python-ideas mailing list > Python-ideas at python.org > https://mail.python.org/mailman/listinfo/python-ideas > > -- > > --- > You received this message because you are subscribed to a topic in the > Google Groups "python-ideas" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/python-ideas/dDttJfkyu2k/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > python-ideas+unsubscribe at googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From python at mrabarnett.plus.com Fri Oct 11 23:38:41 2013 From: python at mrabarnett.plus.com (MRAB) Date: Fri, 11 Oct 2013 22:38:41 +0100 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com>

<01670D03-157D-49A5-A611-420B05F67DD8@yahoo.com> Message-ID: <52586FE1.8040803@mrabarnett.plus.com> On 11/10/2013 21:27, David Mertz wrote: > Andrew & Neil (or whoever): > > Is this *really* what you want: > > >>> from itertools import permutations > >>> def nonredundant_permutations(seq): > ... return list(set(permutations(seq))) > ... > >>> pprint(list(permutations([F(3,1), D(3.0), 3.0]))) > [(Fraction(3, 1), Decimal('3'), 3.0), > (Fraction(3, 1), 3.0, Decimal('3')), > (Decimal('3'), Fraction(3, 1), 3.0), > (Decimal('3'), 3.0, Fraction(3, 1)), > (3.0, Fraction(3, 1), Decimal('3')), > (3.0, Decimal('3'), Fraction(3, 1))] > > >>> pprint(list(nonredundant_permutations([F(3,1), D(3.0), 3.0]))) > [(Fraction(3, 1), Decimal('3'), 3.0)] > > It seems odd to me to want that. On the other hand, I provide a > one-line implementation of the desired behavior if anyone wants it. > Moreover, I don't think the runtime behavior of my one-liner is > particularly costly... maybe not the best possible, but the best big-O > possible. > n! gets very big very fast, so that can be a very big set. If you sort the original items first then it's much easier to yield unique permutations without having to remember them. (Each would be > than the previous one, although you might have to map them to orderable keys if they're not orderable themselves, e.g. a mixture of integers and strings.) > > > On Fri, Oct 11, 2013 at 1:19 PM, Andrew Barnert > wrote: > > I think equality is perfectly reasonable here. The fact that {3.0, > 3} only has one member seems like the obvious precedent to follow here. > > Sent from a random iPhone > > On Oct 11, 2013, at 13:02, David Mertz > wrote: > >> What would you like this hypothetical function to output here: >> >> >>> from itertools import permutations >> >>> from decimal import Decimal as D >> >>> from fractions import Fraction as F >> >>> items = (3, 3.0, D(3), F(3,1), "aa", "AA".lower(), "a"+"a") >> >>> list(permutations(items)) >> >> It's neither QUITE equality nor identity you are looking for, I >> think, in nonredundant_permutation(): >> >> >> "aa" == "AA".lower(), "aa" is "AA".lower() >> (True, False) >> >>> "aa" == "a"+"a", "aa" is "a"+"a" >> (True, True) >> >>> D(3) == 3.0, D(3) is 3.0 >> (True, False) >> >> On Fri, Oct 11, 2013 at 11:38 AM, Neil Girdhar >> > wrote: >> >> "It is universally agreed that a list of n distinct symbols >> has n! permutations. However, when the symbols are not >> distinct, the most common convention, in mathematics and >> elsewhere, seems to be to count only distinct permutations." ? >> http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original. >> >> >> Should we consider fixing itertools.permutations and to output >> only unique permutations (if possible, although I realize that >> would break code). It is completely non-obvious to have >> permutations returning duplicates. For a non-breaking >> compromise what about adding a flag? >> From mistersheik at gmail.com Fri Oct 11 23:38:27 2013 From: mistersheik at gmail.com (Neil Girdhar) Date: Fri, 11 Oct 2013 17:38:27 -0400 Subject: [Python-ideas] Fwd: Extremely weird itertools.permutations In-Reply-To: References: <9ae0d30b-1c32-4041-9282-19d00a9f8f9f@googlegroups.com>