[Python-ideas] strings as iterables - from str.startswith taking any iterator instead of just tuple

Fri Jan 3 12:41:09 CET 2014

On 3 January 2014 20:19, spir <denis.spir at gmail.com> wrote:
> On 01/03/2014 04:54 AM, Alexander Heger wrote:
>>>
>>> By designing an API that doesn't require such overloading.
>>>
>>> On Thursday, January 2, 2014, Alexander Heger wrote:
>>>>
>>>>
>>>>>>     isinstance(x, Iterable) and not isinstance(x, str)
>>>>>
>>>>>
>>>>> If you find yourself typing that a lot I think you have a bigger
>>>>> problem
>>>>> though.
>>>>
>>>>
>>>> How do you replace this?
>>
>>
>> for my applications this seemed the most natural way - have the method
>> deal with what it is fed, which could be strings or any kind of
>> collections or iterables of strings.  But never would I want to
>> disassemble strings into characters.  From the previous message I
>> gather that I am not the only one with this application case.
>>
>> Generally, I find strings being iterables of characters as useful as
>> if integers were iterables of bits.  They should just be units.  They
>> already start out being not mutable.  I think it would be a positive
>> design change for Python 4 to make them units instead of being
>> iterables.  At least for me, there is much fewer applications where
>> the latter is useful than where it requires extra code.  Overall, it
>> makes the language less clean that a string is an iterable; a special
>> case we always have to code around.
>>
>> I know it will break a lot of existing code, but so did the string
>> change from py2 to 3.  (It would break very few of my codes, though.)
>
>
> I agree there is an occasionnal need which I also met in real code: it was
> parse result data, which can be a string (terminal patterns, that really
> "eat" part of the source) or list (or otherwise "tre" iterable collection,
> for composite or repetitive patterns). But the case is rare because it
> requires coincidence of conditions:
> * both string and collections may come as input
> * both are valid, from the app's logics' point of view
> * one want to iterate collections, but not strings
>
> On the other hand, I find you much too quickly dismiss real and very common
> need to iterate strings (on the lowest units of code points), apparently on
> the only base that in your own programming practice you don't need/want it.
>
> We should not make iterating strings a special case (eg by requiring
> explicit call to an iterator like for ucode in s.ucodes() because the case
> is so common. Instead we may consider finding a way to exclude strings in
> some collection traversal idiom (for which I have good proposal: the obvious
> one would .items(), but it's used for a different meaning), which would for
> instance yield an exception on strings because they don't match the idiom
> ("str object has no 'items' attribute").

The underlying problem is that strings have a dual nature: you can
view them as either a sequence of code points (which is how Python
models them), or else you can view them as an opaque chunk of text
(which is often how you want to treat them in code that accepts either
containers or atomic values and treats them differently).

This has some interesting implications for API design.

"def f(*args)" handles the constraint fairly well, as f("astring") is
treated as a single value and f(*"string") is an unlikely mistake for
anyone to make.

"def f(iterable)" has problems in many cases, since f("string") is
treated as an iterable of code points, even if you'd prefer an
immediate error.

"def f(iterable_or_atomic)" also has problems, since strings will use
the "iterable" path, even if the atomic handling would be more
appropriate.

Algorithms that recursively descend into containers also need to deal
with the fact that doing so with strings causes an infinite loop
(since iterating over a string produces length 1 strings).

This is a genuine problem, which is why the question of how to cleanly
deal with these situations keeps coming up every couple of years, and
the current state of the art answer is "grit your teeth and use
isinstance(obj, str)" (or a configurable alternative).

However, I'm wondering if it might be reasonable to add a new entry in
collections.abc for 3.5:

>>> from abc import ABC
>>> from collections.abc import Iterable
>>> class Atomic(ABC):
...     @classmethod
...     def __subclasshook__(cls, subclass):
...         if not issubclass(subclass, Iterable):
...             return True
...         return NotImplemented
...
>>> Atomic.register(str)
<class 'str'>
>>> Atomic.register(bytes)
<class 'bytes'>
>>> Atomic.register(bytearray)
<class 'bytearray'>
>>> isinstance(1, Atomic)
True
>>> isinstance(1.0, Atomic)
True
>>> isinstance(1j, Atomic)
True
>>> isinstance("Hello", Atomic)
True
>>> isinstance(b"Hello", Atomic)
True
>>> isinstance((), Atomic)
False
>>> isinstance([], Atomic)
False
>>> isinstance({}, Atomic)
False

Any type which wasn't iterable would automatically be considered
atomic, while some types which *are* iterable could *also* be
registered as atomic (with str, bytes and bytearray being the obvious
candidates, as shown above).

Armed with such an ABC, you could then write an "iter_non_atomic"
helper function as:

    def iter_non_atomic(iterable):
        if isinstance(iterable, Atomic):
            raise TypeError("{!r} is considered
atomic".format(iterable.__class__.__name__)
        return iter(iterable)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia