Inconsistent behaviour os str.find/str.index when providing optional parameters

Thu Nov 22 03:44:21 EST 2012

On Wed, 21 Nov 2012 23:01:47 -0800, Giacomo Alzetta wrote:

> Il giorno giovedì 22 novembre 2012 05:00:39 UTC+1, MRAB ha scritto:
>> On 2012-11-22 03:41, Terry Reedy wrote: It can't return 5 because 5
>> isn't an index in 'spam'.
>> 
>> 
>> 
>> It can't return 4 because 4 is below the start index.
> 
> Uhm. Maybe you are right, because returning a greater value would cause
> an IndexError, but then, *why* is 4 returned???
> 
>>>> 'spam'.find('', 4)
> 4
>>>> 'spam'[4]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> IndexError: string index out of range
> 
> 4 is not a valid index either. I do not think the behaviour was
> completely intentional.

The behaviour is certainly an edge case, but I think it is correct.

(Correct or not, it has been the same going all the way back to Python 
1.5, before strings even had methods, so it almost certainly will not be 
changed. Changing the behaviour now will very likely break hundreds, 
maybe thousands, of Python programs that expect the current behaviour.)

Consider your string as a sequence of boxes, with index positions 
labelled above the string:

0-1-2-3-4
|s|p|a|m|

The indexing model is that positions represent where you would cut 
*between* characters, not the character itself. Slices are the substring 
between cuts:

"spam"[1:3] => "pa"

while single indexes return the character to the right of the cut:

"spam"[1] => "p"

If there is no character to the right of the cut, indexing raises an 
error.

Now, consider "spam".find(substring, start). This should return the 
number of the first cut immediately to the left of the substring, 
beginning the search at cut #start.

"spam".find("pa", 1) => 1

because cut #1 is immediately to the left of "pa" at index 1.

By this logic, "spam".find("", 4) should return 4, because cut #4 is 
immediately to the left of the empty string. So Python's current 
behaviour is justified.

What about "spam".find("", 5)? Well, if you look at the string with the 
cuts marked as before:

0-1-2-3-4
|s|p|a|m|

you will see that there is no cut #5. Since there is no cut #5, we can't 
sensibly say we found *anything* there, not even the empty string. If you 
have four boxes, you can't say that you found anything in the fifth box.

I realise that this behaviour clashes somewhat with the slicing rule that 
says that if the slice indexes go past the end of the string, you get an 
empty string. But that rule is more for convenience than a fundamental 
rule about strings.

I think there is legitimate room for disagreement about the "right" 
behaviour here, but backwards compatibility trumps logical correctness 
here, and it is very unlikely to be changed.

> The docstring does not describe this edge case, so I think it could be
> improved. If the first sentence(being an index in S) is kept, than it
> shouldn't say that start and end are treated as in slice notation,
> because that's actually not true. 

+1

I think that you are right that the documentation needs to be improved.

-- 
Steven