Bug in string.find; was: Re: Proposed PEP: New style indexing,was Re: Bug in slice type

Tue Aug 30 04:53:27 EDT 2005

Terry Reedy wrote:
 > "Paul Rubin" wrote:
 >
 >>Really it's x[-1]'s behavior that should go, not find/rfind.
 >
 > I complete disagree, x[-1] as an abbreviation of x[len(x)-1] is 
extremely
 > useful, especially when 'x' is an expression instead of a name.

Hear us out; your disagreement might not be so complete as you
think. From-the-far-end indexing is too useful a feature to
trash. If you look back several posts, you'll see that the
suggestion here is that the index expression should explicitly
call for it, rather than treat negative integers as a special
case.

I wrote up and sent off my proposal, and once the PEP-Editors
respond, I'll be pitching it on the python-dev list. Below is
the version I sent (not yet a listed PEP).

--
--Bryan

PEP: -1
Title: Improved from-the-end indexing and slicing
Version: $Revision: 1.00 $
Last-Modified: $Date: 2005/08/26 00:00:00 $
Author: Bryan G. Olson <bryan.olson at acm.org>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 26 Aug 2005
Post-History:

Abstract

     To index or slice a sequence from the far end, we propose
     using a symbol, '$', to stand for the length, instead of
     Python's current special-case interpretation of negative
     subscripts. Where Python currently uses:

         sequence[-i]

     We propose:

         sequence[$ - i]

     Python's treatment of negative indexes as offsets from the
     high end of a sequence causes minor obvious problems and
     major subtle ones. This PEP proposes a consistent meaning
     for indexes, yet still supports from-the-far-end
     indexing. Use of new syntax avoids breaking existing code.

Specification

     We propose a new style of slicing and indexing for Python
     sequences. Instead of:

         sequence[start : stop : step]

     new-style slicing uses the syntax:

         sequence[start ; stop ; step]

     It works like current slicing, except that negative start or
     stop values do not trigger from-the-high-end interpretation.
     Omissions and 'None' work the same as in old-style slicing.

     Within the square-brackets, the '$' symbol stands for the
     length of the sequence. One can index from the high end by
     subtracting the index from '$'. Instead of:

         seq[3 : -4]

     we write:

         seq[3 ; $ - 4]

     When square-brackets appear within other square-brackets,
     the inner-most bracket-pair determines which sequence '$'
     describes. The length of the next-outer sequence is denoted
     by '$1', and the next-out after than by '$2', and so on. The
     symbol '$0' behaves identically to '$'. Resolution of $x is
     syntactic; a callable object invoked within square brackets
     cannot use the symbol to examine the context of the call.

     The '$' notation also works in simple (non-slice) indexing.
     Instead of:

         seq[-2]

     we write:

         seq[$ - 2]

     If we did not care about backward compatibility, new-style
     slicing would define seq[-2] to be out-of-bounds. Of course
     we do care about backward compatibility, and rejecting
     negative indexes would break way too much code. For now,
     simple indexing with a negative subscript (and no '$') must
     continue to index from the high end, as a deprecated
     feature. The presence of '$' always indicates new-style
     indexing, so a programmer who needs a negative index to
     trigger a range error can write:

         seq[($ - $) + index]

Motivation

     From-the-far-end indexing is such a useful feature that we
     cannot reasonably propose its removal; nevertheless Python's
     current method, which is to treat a range of negative
     indexes as special cases, is warty. The wart bites novice or
     imperfect Pythoners by not raising an exceptions when they
     need to know about a bug. For example, the following code
     prints 'y' with no sign of error:

         s = 'buggy'
         print s[s.find('w')]

     The wart becomes an even bigger problem with more
     sophisticated use of Python sequences. What is the 'stop'
     value for a slice when the step is negative and the slice
     includes the zero index? An instance of Python's slice type
     will report that the stop value is -1, but if we use this
     stop value to slice, it gets misinterpreted as the last
     index in the sequence. Here's an example:

         class BuggerAll:

             def __init__(self, somelist):
                 self.sequence = somelist[:]

             def __getitem__(self, key):
                 if isinstance(key, slice):
                     start, stop, step = key.indices(len(self.sequence))
                     # print 'Slice says start, stop, step are:', start, 
stop, step
                     return self.sequence[start : stop : step]

         print           range(10) [None : None : -2]
         print BuggerAll(range(10))[None : None : -2]

     The above prints:

         [9, 7, 5, 3, 1]
         []

     Un-commenting the print statement in __getitem__ shows:

         Slice says start, stop, step are: 9 -1 -2

     The slice object seems to think that -1 is a valid exclusive
     bound, but when using it to actually slice, Python
     interprets the negative number as an offset from the high
     end of the sequence.

     Steven Bethard offered the simpler example:

         py> range(10)[slice(None, None, -2)]
         [9, 7, 5, 3, 1]
         py> slice(None, None, -2).indices(10)
         (9, -1, -2)
         py> range(10)[9:-1:-2]
         []

     The double-meaning of -1, as both an exclusive stopping
     bound and an alias for the highest valid index, is just
     plain whacked. So what should the slice object return? With
     Python's current indexing/slicing, there is no value that
     just works. 'None' will work as a stop value in a slice, but
     index arithmetic will fail.  The value 0 - (len(sequence) +
     1) will work as a stop value, and slice arithmetic and
     range() will happily use it, but the result is not what the
     programmer probably intended.

     The problem is subtle. A Python sequence starts at index
     zero. There is some appeal to giving negative indexes a
     useful interpretation, on the theory that they were invalid
     as subscripts and thus useless otherwise. That theory is
     wrong, because negative indexes were already useful, even
     though not legal subscripts, and the reinterpretation often
     breaks their exiting use. Specifically, negative indexes are
     useful in index arithmetic, and as exclusive stopping
     bounds.

     The problem is fixable. We propose that negative indexes not
     be treated as a special case. To index from the far end of a
     sequence, we use a syntax that explicitly calls for far-end
     indexing.

Rationale

     New-style slicing/indexing is designed to fix the problems
     described above, yet live happily in Python along-side the
     old style. The new syntax leaves the meaning of existing
     code unchanged, and is even more Pythonic than current
     Python.

     Semicolons look a lot like colons, so the new semicolon
     syntax follows the rule that things that are similar should
     look similar. The semicolon syntax is currently illegal, so
     its addition will not break existing code. Python is
     historically tied to C, and the semicolon syntax is
     evocative of the similar start-stop-step expressions of C's
     'for' loop.  JPython is tied to Java, which uses a similar
     'for' loop syntax.

     The '$' character currently has no place in a Python index,
     so its new interpretation will not break existing code. We
     chose it over other unused symbols because the usage roughly
     corresponds to its meaning in the Python library's regular
     expression module.

     We expect use of the $0, $1, $2 ... syntax to be rare;
     nevertheless, it has a Pythonic consistency. Thanks to Paul
     Rubin for advocating it over the inferior multiple-$ syntax
     that this author initially proposed.

Backwards Compatibility

     To avoid braking code, we use new syntax that is currently
     illegal. The new syntax more-or-less looks like current
     Python, which may help Python programmers adjust.

     User-defined classes that implement the sequence protocol
     are likely to work, unchanged, with new-style slicing.
     'Likely' is not certain; we've found one subtle issue (and
     there may be others):

     Currently, user-defined classes can implement Python
     subscripting and slicing without implementing Python's len()
     function. In our proposal, the '$' symbol stands for the
     sequence's length, so classes must be able to report their
     length in order for $ to work within their slices and
     indexes.

     Specifically, to support new-style slicing, a class that
     accepts index or slice arguments to any of:

         __getitem__
         __setitem__
         __delitem__
         __getslice__
         __setslice__
         __delslice__

     must also consistently implement:

         __len__

     Sane programmers already follow this rule.

Copyright:

     This document has been placed in the public domain.