[Python-ideas] O(1) containment testing with Py3 range objects (was Re: Make len() usable on a generator)

Sun Oct 5 05:35:46 CEST 2014

On 4 October 2014 22:39, Andrew Barnert <abarnert at yahoo.com> wrote:
> On Oct 4, 2014, at 13:48, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
>> On 4 October 2014 02:45, Ian Cordasco <graffatcolmingov at gmail.com> wrote:
>>> On Fri, Oct 3, 2014 at 11:32 AM, Alexander Belopolsky
>>> <alexander.belopolsky at gmail.com> wrote:
>>>> You are correct about python-ideas, but range is not a generator in python 3
>>>> either:
>>>>
>>>> Python 3.4.0a0 (default:e9cecb612ff7+, Oct  3 2014, 12:07:42)
>>>>>>> type(range(10))  # not a generator
>>>> <class 'range'>
>>>>>>> type(x for x in range(10))  # a generator
>>>> <class 'generator'>
>>>
>>> Ah yeah. All too often I conflate lazy iterators with generators. Nevermind
>>
>> Python 3 range isn't a lazy iterator either - it's a full sequence
>> type (specifically, a calculated tuple:
>> https://docs.python.org/3/library/stdtypes.html#ranges). The only
>> immutable sequence operations it doesn't support are concatenation and
>> repetition (since concatenating or repeating a range can't be
>> represented using the simple "start + n*step" calculation that range
>> uses internally).
>>
>> We'd have gotten a *lot* more complaints if we just dropped xrange in
>> as a substitute for the list objects returned by Python 2's range
>> builtin :)
>
> For some reason, most of the 3.x early adopters understood this, but later switchers seem to often believe that this is exactly what Python 3 did.

I've seen a few transition guides that oversimplify the situation as
"Python 3 range is the same as Python 2 xrange". Those transition
guides are wrong (xrange has many limitations compared to the more
full featured Python 3 range implementation), but we don't have the
capacity to chase down every occurrence and ask people to fix it.

One nice trick with Py3 range for example, is to work with ranges that
Py2 could never hope to handle:

$ python3 -m timeit -s "seq = range(-(10**24), 10**24)" "0 in seq"
1000000 loops, best of 3: 0.295 usec per loop

$ python -m timeit -s "seq = range(-(10**24), 10**24)" "0 in seq"
Traceback (most recent call last):
  File "/usr/lib64/python2.7/timeit.py", line 300, in main
    x = t.timeit(number)
  File "/usr/lib64/python2.7/timeit.py", line 195, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 3, in inner
    seq = range(-(10**24), 10**24)
OverflowError: range() result has too many items

There's still an unfortunate technical limitation in Py3 where *len*
is restricted to 64 bit integers, but start/stop/step are exposed on
range objects these days, so you can do the calculation directly if
you really need to. (That workaround severely reduces the motivation
for anyone to put together a backwards compatible proposal for an
alternative C level length API that uses a Python object rather than a
C sssize_t value)

> For example, at least half a dozen times, I've written an answer on StackOverflow showing someone that `n in range(start, stop)` is what they were trying to write, and explaining that it means the same thing as `start <= n < stop` (with various caveats about Python 2, and `step`, etc.) only for someone to edit or comment on my answer claiming that range.__contents__ has to be linear and should never be used. I explain that it's a sequence, and can be indexed, and show them a quick timing test, and they still insist that's just an accidental optimization that shouldn't be relied on, because it's really just meant to be Python 2's xrange. Once someone insisted that if you want this functionality you should use a multi-interval class off PyPI, even though the docs for that class explicitly say that a single interval is just an empty subclass of range in Py 3. People who stick with 2.x just really want to believe you got this wrong, I guess.

Containment tests on range() are guaranteed to be O(1) when suing
CPython to test against true int and bool objects. Other
implementations may not offer the same guarantee (although it's a
simple enough optimisation that I believe they should).

CPython itself will currently fall back to O(n) behaviour for int
subclasses and other objects  to handle backwards compatibility issues
in relation to custom __eq__ methods (see
http://bugs.python.org/issue1766304 for history and details - note
that this *was* under consideration for 2.7 xrange as well, and merely
didn't make the release deadline).

For those that haven't seen the kind of timing examples Andrew
mentioned, one easy way to see the difference is by comparing integer
lookup performance with a numeric type that isn't an integer at all:

$ python3 -m timeit -s "seq = range(10**6)" "10e5 in seq"
10 loops, best of 3: 118 msec per loop
$ python3 -m timeit -s "seq = range(10**6)" "int(10e5) in seq"
1000000 loops, best of 3: 0.348 usec per loop

You cop a small constant-time hit for the float->int conversion, but
gain back 6 orders of magnitude (in this particular example) through
the switch to an O(1) containment check.

This optimisation is particularly useful in cases that can't easily be
converted to comparison operations, such as when the range has a step
magnitude other than 1.

I thought I had an open issue proposing to expand the O(1)
optimisation to any type that implements index (as well as making it a
documented guarantee all implementations are expected to provide), but
it looks like I may have only thought about filing that rather than
actually doing it. (Or perhaps it's hiding in the python-ideas
archives somewhere)

> I've seen similar things with, e.g., people not believing that dict.keys() returns a set that's a view into the dict's storage rather than some kind of magical iterator that knows how to be used multiple times. (I'm not even sure how that would work; how could __iter__ know when you were trying to get self and when you were trying to start over?)

That at least has an authoritative reference in PEP 3106 (along with
the standard library docs). Unfortunately, the more convinced someone
is that they already understand a change, the less likely they are to
read the references to the relevant design discussions and decisions
:P

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia