[Python-Dev] Bad interaction of index and sequence repeat

Sat Jul 29 16:06:53 CEST 2006

Armin Rigo wrote:
> Hi,
> 
> There is an oversight in the design of __index__() that only just
> surfaced :-(  It is responsible for the following behavior, on a 32-bit
> machine with >= 2GB of RAM:
> 
>     >>> s = 'x' * (2**100)       # works!
>     >>> len(s)
>     2147483647
> 
> This is because PySequence_Repeat(v, w) works by applying w.__index__ in
> order to call v->sq_repeat.  However, __index__ is defined to clip the
> result to fit in a Py_ssize_t.  This means that the above problem exists
> with all sequences, not just strings, given enough RAM to create such
> sequences with 2147483647 items.
> 
> For reference, in 2.4 we correctly get an OverflowError.
> 
> Argh!  What should be done about it?

I've now got a patch on SF that aims to fix this properly [1].

The gist of the patch:

1. Redesign the PyNumber_Index C API to serve the actual use cases in the 
interpreter core and the standard library.

   The PEP 357 abstract C API as written was bypassed by nearly all of the 
uses in the core and the standard library. The patch redesigns that API to 
reduce code duplication between the various parts of the code base that were 
previously calling nb_index directly.

   The principal change is to provide an "is_index" output variable that the 
various mp_subscript implementations can use to determine whether or not the 
passed in object was an index or not, rather than having to repeat the type 
check everywhere. The rationale for doing it this way:
   a. you may want to try something else (e.g. the mp_subscript 
implementations in the standard library try indexing before checking to see if 
the object is a slice object)
   b. a different error message may be wanted (e.g. the normal indexing 
related Type Error doesn't make sense for sequence repetition)
   c. a separate checking function would lead to repeating the check on common 
code paths (e.g. if an mp_subscript implementation did the type check first, 
and then PyNumber_Check did it again to see whether or not to raise an error)

   The output variable solves the problem nicely: either pass in NULL to get 
the default behaviour of raising a sequence indexing TypeError, or pass in a 
pointer to a C int in order to be told whether or not the typecheck succeeded 
without an exception actually being set if it fails (if the typecheck passes, 
but the actual call fails, the exception state is set as normal).

   Additionally, PyNumber_Index is redefined to raise an IndexError for values 
that cannot be represented as a Py_ssize_t. The choice of IndexError was made 
based on the dominant usage in the standard library (IndexError is the correct 
error to raise so that an mp_subscript implementation does the right thing). 
There are only a few places that need to override the IndexError to replace it 
with OverflowError (the length argument to slice.indices, sequence repetition, 
the mmap constructor), whereas all of the sequence objects (list, tuple, 
string, unicode, array), as well as PyObject_Get/Set/DelItem, need it to raise 
IndexError.

   Raising IndexError also benefits sequences implemented in Python, which can 
simply do:

   def __getitem__(self, idx):
      if isinstance(idx, slice):
          return self._get_slice(idx)
      idx = operator.index(idx) # Will raise IndexError on overflow

   A second API function PyNumber_SliceIndex is added so that the clipping 
semantics are still available where needed and _PyEval_SliceIndex is modified 
to use the new public API. This is exposed to Python code as 
operator.sliceindex().

   With the redesigned C API, the *only* code that calls the nb_index slot 
directly is the two functions in abstract.c. Everything else uses one or the 
other of those interfaces. Code duplication was significantly reduced as a 
result, and it should be much easier for 3rd party C libraries to do what they 
need to do (i.e. implementing mp_subscript slots).

2. Redefine nb_index to return a PyObject *

   Returning the PyInt/PyLong objects directly from nb_index greatly 
simplified the implementation of the nb_index methods for the affected 
classes. For classic classes, instance_index could be modified to simply 
return the result of calling __index__, as could slot_nb_index for new-style 
classes. For the standard library classes, the existing int_int method, and 
the long_long method could be used instead of needing new functions.

   This convenience should hold true for extension classes - instead of 
needing to implement __index__ separately, they should be able to reuse their 
existing __int__ or __long__ implementations.

   The other benefit is that the logic to downconvert to Py_ssize_t that was 
formerly invoked by long's __index__ method is now instead invoked by 
PyNumber_Index and PyNumber_SliceIndex. This means that directly calling an 
__index__() method allows large long results to be passed through unaffected, 
but calling the indexing operator will raise IndexError if the long is outside 
the memory address space:

   (2 ** 100).__index__() == (2**100)  # This works
   operator.index(2**100)              # This raises IndexError

The patch includes additions to test_index.py to cover these limit cases, as 
well as the necessary updates to the C API and operator module documentation.

Cheers,
Nick.

[1] http://www.python.org/sf/1530738

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

[Python-Dev] Bad interaction of __index__ and sequence repeat

[Python-Dev] Bad interaction of index and sequence repeat