[Python-ideas] discontinue iterable strings

Sun Aug 21 01:22:55 EDT 2016

On 21 August 2016 at 14:10, Chris Angelico <rosuav at gmail.com> wrote:
> On Sun, Aug 21, 2016 at 12:52 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> I think that while the suggestion does bring some benefit, the benefit
>> isn't enough to make up for the code churn and disruption it would
>> cause. But I encourage the OP to go through the standard library, pick a
>> couple of modules, and re-write them to see how they would look using
>> this proposal.
>
> Python still has a rule that you can iterate over anything that has
> __getitem__, and it'll be called with 0, 1, 2, 3... until it raises
> IndexError. So you have two options: Remove that rule, and require
> that all iterable objects actually define __iter__; or make strings
> non-subscriptable, which means you need to do something like
> "asdf".char_at(0) instead of "asdf"[0]. IMO the second option is a
> total non-flyer - good luck convincing anyone that THAT is an
> improvement. The first one is possible, but dramatically broadens the
> backward-compatibility issue. You'd have to search for any class that
> defines __getitem__ and not __iter__.

That's not actually true - any type that defines __getitem__ can
prevent iteration just by explicitly raising TypeError from __iter__.
It would be *weird* to do so, but it's entirely possible.

However, the real problem with this proposal (and the reason why the
switch from 8-bit str to "bytes are effectively a tuple of ints" in
Python 3 was such a pain), is that there are a lot of bytes and text
processing operations that *really do* operate code point by code
point.

Scanning a path for directory separators, scanning a CSV (or other
delimited format) for delimiters, processing regular expressions,
tokenising according to a grammar, analysing words in a text for
character popularity, answering questions like "Is this a valid
identifier?" all involve looking at each character in a sequence
individually, rather than looking at the character sequence as an
atomic unit.

The idiomatic pattern for doing that kind of "item by item" processing
in Python is iteration (whether through the Python syntax and
builtins, or through the CPython C API).

Now, if we were designing a language from scratch today, there's a
strong case to be made that the *right* way to represent text is to
have a stream-like interface (e.g. StringIO, BytesIO) around an atomic
type (e.g. CodePoint, int). But we're not designing a language from
scratch - we're iterating on one with a 25 year history of design,
development, and use.

There may also be a case to be made for introducing an AtomicStr type
into Python's data model that works like a normal string, but
*doesn't* support indexing, slicing, or iteration, and is instead an
opaque blob of data that nevertheless supports all the other usual
string operations. (Similar to the way that types.MappingProxyType
lets you provide a read-only view of an otherwise mutable mapping, and
that collections.KeysView, ValuesView and ItemsView provide different
interfaces for a common underlying mapping)

But changing the core text type itself to no longer be suitable for
use in text processing tasks? Not gonna happen :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia