[Python-Dev] Bad interaction of index and sequence repeat

Fri Jul 28 17:55:47 CEST 2006

[Armin Rigo]
> There is an oversight in the design of __index__() that only just
> surfaced :-(  It is responsible for the following behavior, on a 32-bit
> machine with >= 2GB of RAM:
>
>     >>> s = 'x' * (2**100)       # works!
>     >>> len(s)
>     2147483647
>
> This is because PySequence_Repeat(v, w) works by applying w.__index__ in
> order to call v->sq_repeat.

?  I don't see an invocation of __index__ or nb_index in
PySequence_Repeat.  To the contrary, its /incoming/ `count` argument
is constrained to Py_ssize_t from the start:

    PyObject * PySequence_Repeat(PyObject *o, Py_ssize_t count)

... OK, I think you mean sequence_repeat() in abstract.c.  That does
invoke nb_index.  But, as below, I don't think it should in this case.

> However, __index__ is defined to clip the result to fit in a Py_ssize_t.
>  This means that the above problem exists
> with all sequences, not just strings, given enough RAM to create such
> sequences with 2147483647 items.
>
> For reference, in 2.4 we correctly get an OverflowError.
>
> Argh!  What should be done about it?

IMO, this is plain wrong.  PEP 357 isn't entirely clear, but it is
clear the author only had /slicing/ in mind (where clipping makes
sense -- and which makes `__index__` a misleading name).  Guido
pointed out the ambiguity here:

    http://mail.python.org/pipermail/python-dev/2006-February/060624.html

    There's also an ambiguity when using simple indexing. When writing
    x[i] where x is a sequence and i an object that isn't int or long but
    implements __index__, I think i.__index__() should be used rather than
    bailing out.  I suspect that you didn't think of this because you've
    already special-cased this in your code -- when a non-integer is
    passed, the mapping API is used (mp_subscript). This is done to
    suppose extended slicing. The built-in sequences (list, str, unicode,
    tuple for sure, probably more) that implement mp_subscript should
    probe for nb_index before giving up. The generic code in
    PyObject_GetItem should also check for nb_index before giving up.

So, e.g., plain a[i] shouldn't use __index__ either if i is already
int or long.  I don't see any justification for invoking nb_index in
sequence_repeat(), although if someone thinks it should, then as for
plain indexing it certainly shouldn't invoke nb_index if the incoming
count is an int or long to begin with.

Ah, fudge.  Contrary to Guido's advice above, I see that
PyObject_GetItem() /also/ unconditionally invokes nb_index (even when
the incoming key is already int or long).  It shouldn't do that either
(according to me).

OTOH, in the long discussion about PEP 357, I'm not sure anyone except
Travis was clear on whether nb_index was meant to apply only to
sequence /slicing/ or was meant to apply "everywhere an object gets
used in an index-like context".  Clipping makes sense only for the
former, but it looks like the implementation treats it more like the
latter.  This was probably exacerbated by:

    http://mail.python.org/pipermail/python-dev/2006-February/060663.html

    [Travis]
    There are other places in Python that check specifically for int objects
    and long integer objects and fail with anything else.  Perhaps all of
    these should aslo call the __index__ slot.

    [Guido]
    Right, absolutely.

This is a mess :-)

[Python-Dev] Bad interaction of __index__ and sequence repeat

[Python-Dev] Bad interaction of index and sequence repeat