[Python-ideas] strings as iterables - from str.startswith taking any iterator instead of just tuple

Fri Jan 3 15:21:31 CET 2014

On 01/03/2014 12:41 PM, Nick Coghlan wrote:
> The underlying problem is that strings have a dual nature: you can
> view them as either a sequence of code points (which is how Python
> models them), or else you can view them as an opaque chunk of text
> (which is often how you want to treat them in code that accepts either
> containers or atomic values and treats them differently).
>
> This has some interesting implications for API design.
>
> "def f(*args)" handles the constraint fairly well, as f("astring") is
> treated as a single value and f(*"string") is an unlikely mistake for
> anyone to make.
>
> "def f(iterable)" has problems in many cases, since f("string") is
> treated as an iterable of code points, even if you'd prefer an
> immediate error.
>
> "def f(iterable_or_atomic)" also has problems, since strings will use
> the "iterable" path, even if the atomic handling would be more
> appropriate.
>
> Algorithms that recursively descend into containers also need to deal
> with the fact that doing so with strings causes an infinite loop
> (since iterating over a string produces length 1 strings).
>
> This is a genuine problem, which is why the question of how to cleanly
> deal with these situations keeps coming up every couple of years, and
> the current state of the art answer is "grit your teeth and use
> isinstance(obj, str)" (or a configurable alternative).
>
> However, I'm wondering if it might be reasonable to add a new entry in
> collections.abc for 3.5:
>
>>>> >>>from abc import ABC
>>>> >>>from collections.abc import Iterable
>>>> >>>class Atomic(ABC):
> ...     @classmethod
> ...     def __subclasshook__(cls, subclass):
> ...         if not issubclass(subclass, Iterable):
> ...             return True
> ...         return NotImplemented
> ...
>>>> >>>Atomic.register(str)
> <class 'str'>
>>>> >>>Atomic.register(bytes)
> <class 'bytes'>
>>>> >>>Atomic.register(bytearray)
> <class 'bytearray'>
>>>> >>>isinstance(1, Atomic)
> True
>>>> >>>isinstance(1.0, Atomic)
> True
>>>> >>>isinstance(1j, Atomic)
> True
>>>> >>>isinstance("Hello", Atomic)
> True
>>>> >>>isinstance(b"Hello", Atomic)
> True
>>>> >>>isinstance((), Atomic)
> False
>>>> >>>isinstance([], Atomic)
> False
>>>> >>>isinstance({}, Atomic)
> False
>
> Any type which wasn't iterable would automatically be considered
> atomic, while some types which *are* iterable could *also* be
> registered as atomic (with str, bytes and bytearray being the obvious
> candidates, as shown above).
>
> Armed with such an ABC, you could then write an "iter_non_atomic"
> helper function as:
>
>      def iter_non_atomic(iterable):
>          if isinstance(iterable, Atomic):
>              raise TypeError("{!r} is considered
> atomic".format(iterable.__class__.__name__)
>          return iter(iterable)

I like this solution. But would live with checking for type (usually str). The 
point is that, while not that uncommon, when the issue arises one has to deal 
with it at one or at most a few places in code (typically at start of one a few 
methods of a given type). It is not as if we had to carry an unneeded overload 
about everywhere.

Denis