Inconsistent behaviour os str.find/str.index when providing optional parameters

Thu Nov 22 13:22:57 EST 2012

Il giorno giovedì 22 novembre 2012 09:44:21 UTC+1, Steven D'Aprano ha scritto:
> On Wed, 21 Nov 2012 23:01:47 -0800, Giacomo Alzetta wrote:
> 
> 
> 
> > Il giorno giovedì 22 novembre 2012 05:00:39 UTC+1, MRAB ha scritto:
> 
> >> On 2012-11-22 03:41, Terry Reedy wrote: It can't return 5 because 5
> 
> >> isn't an index in 'spam'.
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> It can't return 4 because 4 is below the start index.
> 
> > 
> 
> > Uhm. Maybe you are right, because returning a greater value would cause
> 
> > an IndexError, but then, *why* is 4 returned???
> 
> > 
> 
> >>>> 'spam'.find('', 4)
> 
> > 4
> 
> >>>> 'spam'[4]
> 
> > Traceback (most recent call last):
> 
> >   File "<stdin>", line 1, in <module>
> 
> > IndexError: string index out of range
> 
> > 
> 
> > 4 is not a valid index either. I do not think the behaviour was
> 
> > completely intentional.
> 
> 
> 
> 
> 
> The behaviour is certainly an edge case, but I think it is correct.
> 
> 
> 
> (Correct or not, it has been the same going all the way back to Python 
> 
> 1.5, before strings even had methods, so it almost certainly will not be 
> 
> changed. Changing the behaviour now will very likely break hundreds, 
> 
> maybe thousands, of Python programs that expect the current behaviour.)
> 

My point was not to change the behaviour but only to point out this possible inconsistency between what str.find/str.index do and what they claim to do in the documentation.

Anyway I'm not so sure that changing the behaviour would break many programs... I mean, the change would only impact code that was looking for an empty string over the string's bounds. I don't see often using the lo and hi parameters for find/index, and I think I never saw someone using them when they get out of bounds. If you add looking for the empty string I think that the number of programs breaking will be minimum. And even if they break, it would be really easy to fix them.

Anyway, I understand what you mean and maybe it's better to keep this (at least to me) odd behaviour for backwards compatibility.

> 
> By this logic, "spam".find("", 4) should return 4, because cut #4 is 
> 
> immediately to the left of the empty string. So Python's current 
> 
> behaviour is justified.
> 
> 
> 
> What about "spam".find("", 5)? Well, if you look at the string with the 
> 
> cuts marked as before:
> 
> 
> 
> 0-1-2-3-4
> 
> |s|p|a|m|
> 
> 
> 
> you will see that there is no cut #5. Since there is no cut #5, we can't 
> 
> sensibly say we found *anything* there, not even the empty string. If you 
> 
> have four boxes, you can't say that you found anything in the fifth box.
> 
> 
> 
> I realise that this behaviour clashes somewhat with the slicing rule that 
> 
> says that if the slice indexes go past the end of the string, you get an 
> 
> empty string. But that rule is more for convenience than a fundamental 
> 
> rule about strings.

Yeah, I understand what you say, but the logic you pointed out is never cited anywhere, while slices are cited in the docstring.

> 
> > The docstring does not describe this edge case, so I think it could be
> 
> > improved. If the first sentence(being an index in S) is kept, than it
> 
> > shouldn't say that start and end are treated as in slice notation,
> 
> > because that's actually not true. 
> 
> 
> 
> +1
> 
> 
> 
> I think that you are right that the documentation needs to be improved.

Definitely. The sentence "Optional
arguments start and end are interpreted as in slice notation." should be changed to something like:
"Optional arguments start and end are interpreted as in slice notation, unless start is (strictly?) greater than the length of S or end is smaller than start, in which cases the search always fails."

In this way the 'spam'.find('', 4) *is* documented because start=len(S) -> start and end are treated like in slice notation and 4 makes sense, while 'spam'.find('', 5) -> -1 because 5 > len('spam') and thus the search fails
and also 'spam'.find('', 3, 2) -> -1 makes sense because 2 < 3(this edge case makes more sense, even though 'spam'[3:2] is still the empty string...).