[Python-Dev] string.join and bad sequences
Guido van Rossum
guido@beopen.com
Mon, 10 Jul 2000 15:14:50 -0500
> I added a test case to Lib/test/string_tests.py that uses a sequence
> that returns the wrong answer from __len__. I've used this test in a
> number of places to make sure the interpreter doesn't dump core when
> it hits a bad user-defined sequence.
>
> class Sequence:
> def __init__(self): self.seq = 'wxyz'
> def __len__(self): return len(self.seq)
> def __getitem__(self, i): return self.seq[i]
>
> class BadSeq2(Sequence):
> def __init__(self): self.seq = ['a', 'b', 'c']
> def __len__(self): return 8
>
> The test of string.join and " ".join don't dump core, but they do
> raise an IndexError. I wonder if that's the right thing to do,
> because the other places where it is handled no exception is raised.
>
> The question boils down to the semantics of the sequence protocol.
>
> The string code defintion is:
> if __len__ returns X, then the length is X
> thus, __getitem__ should succeed for range(0, X)
> if it doesn't, raise an IndexError
>
> The other code (e.g. PySequence_Tuple) definition is:
> if __len__ return X, then the length is <= X
> if __getitem__ succeeds for range(0, X), then length is indeed X
> if it does not, then length is Y + 1 for highest Y
> where Y is greatest index that actually works
>
> The definition in PySequence_Tuple seemed quite clever when I first
> saw it, but I like it less now. If a user-defined sequence raises
> IndexError when len indicates it should not, the code is broken. The
> attempt to continue anyway is masking an error in user code.
>
> I vote for fixing PySequence_Tuple and the like to raise an
> IndexError.
I'm not sure I agree. When Steve Majewski proposed variable-length
sequences, we ended up conceding that __len__ is just a hint. The
actual length can be longer or shorter. The map and filter functions
allow this, and so do min/max and others that go over sequences, and
even (of course) the for loop. (In fact, the preferred behavior is
not to call __len__ at all but just try x[0], x[1], x[2], ... until
IndexError is hit.)
If I read your description of PySequence_Tuple(), it accepts a __len__
that overestimates but not one that understestimates. That's wrong.
(In Majewski's example, a tar file wrapper would claim to have 0 items
but iterating over it in ascending order would access all the items in
the file. Claiming some arbitrary integer as __len__ would be wrong.)
So string.join(BadSeq2(), "") or "".join(BadSeq2()) should return "abc".
--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)