[Python-Dev] string.join and bad sequences

Guido van Rossum guido@beopen.com
Mon, 10 Jul 2000 15:14:50 -0500


> I added a test case to Lib/test/string_tests.py that uses a sequence
> that returns the wrong answer from __len__.  I've used this test in a
> number of places to make sure the interpreter doesn't dump core when
> it hits a bad user-defined sequence.
> 
> class Sequence:
>     def __init__(self): self.seq = 'wxyz'
>     def __len__(self): return len(self.seq)
>     def __getitem__(self, i): return self.seq[i]
> 
> class BadSeq2(Sequence):
>     def __init__(self): self.seq = ['a', 'b', 'c']
>     def __len__(self): return 8
> 
> The test of string.join and " ".join don't dump core, but they do
> raise an IndexError.  I wonder if that's the right thing to do,
> because the other places where it is handled no exception is raised.
> 
> The question boils down to the semantics of the sequence protocol.
> 
> The string code defintion is:
>     if __len__ returns X, then the length is X
>     thus, __getitem__ should succeed for range(0, X)
>           if it doesn't, raise an IndexError
> 
> The other code (e.g. PySequence_Tuple) definition is:
>     if __len__ return X, then the length is <= X
>     if __getitem__ succeeds for range(0, X), then length is indeed X
>     if it does not, then length is Y + 1 for highest Y 
>                     where Y is greatest index that actually works
> 
> The definition in PySequence_Tuple seemed quite clever when I first
> saw it, but I like it less now.  If a user-defined sequence raises
> IndexError when len indicates it should not, the code is broken.  The
> attempt to continue anyway is masking an error in user code.
> 
> I vote for fixing PySequence_Tuple and the like to raise an
> IndexError.

I'm not sure I agree.  When Steve Majewski proposed variable-length
sequences, we ended up conceding that __len__ is just a hint.  The
actual length can be longer or shorter.  The map and filter functions
allow this, and so do min/max and others that go over sequences, and
even (of course) the for loop.  (In fact, the preferred behavior is
not to call __len__ at all but just try x[0], x[1], x[2], ... until
IndexError is hit.)

If I read your description of PySequence_Tuple(), it accepts a __len__
that overestimates but not one that understestimates.  That's wrong.
(In Majewski's example, a tar file wrapper would claim to have 0 items
but iterating over it in ascending order would access all the items in
the file.  Claiming some arbitrary integer as __len__ would be wrong.)

So string.join(BadSeq2(), "") or "".join(BadSeq2()) should return "abc".

--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)