Why chunks is not part of the python standard lib?

Oscar Benjamin oscar.j.benjamin at gmail.com
Thu May 2 08:52:16 EDT 2013


On 2 May 2013 06:15, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Wed, 01 May 2013 10:00:04 +0100, Oscar Benjamin wrote:
>>
>> I'm not sure that's a valid argument against it since a chunks function
>> could just do a different thing depending on the arguments given.
>
> Yes, but that's a rubbish API. I'd rather have five separate functions.
> Or maybe five methods on a single Chunker object.

Fair enough.

[snip]
> - and none of them are actually very hard to write.

They are all easy to write as generator functions but to me the point
of itertools is that you can do things more efficiently than a
generator function. Otherwise code that uses a combination of
itertools primitives is usually harder to understand than an
equivalent generator function so I'd probably avoid using itertools.

> So the prospect of adding chunks somewhere is unattractive: lots of angst
> for very little benefit. Yes, it could be done, but none of the Python
> developers, and especially not the itertools "owner", Raymond Hettinger,
> think that the added complication is worth the benefit.

It's not necessarily that chunks should be added but that it would be
good if itertools had the necessary primitives to be able to achieve
something like chunks without a generator function. As Wolfgang says
zip() and zip_longest() are only 2 of the 4 obvious types of zipping
and correspond to the 2 of the 4 obvious types of chunks() that can be
done with itertools primitives.

>> The issue is around how to deal with the last chunk if it isn't the same
>> length as the others and I can only think of 4 reasonable responses:
>
> That's not the only issue. What does chunking (grouping) even mean? Given:
>
> chunk("abcdef", 3)
>
> should I get this?  [abc, def]
>
> or this?  [abc, bcd, cde, def]

I don't think many people expect that. In any case that type of
chunking is already possible using itertools.

> There are good use-cases for both.
>
> If given a string, should chunking re-join the individual characters into
> strings, or leave them as lists of chars? Tuples of chars? E.g.
>
> chunk("abcdef", 3) => "abc" ...
>
> or ["a", "b", "c"] ...
>
> How about bytes?

It should be tuples or lists. Anything in itertools should just treat
everything as iterators/iterables and not alter its behaviour for
different types of sequences.

>> 1) Yield a shorter chunk
>> 2) Extend the chunk with fill values
>> 3) Raise an error
>> 4) Ignore the last chunk
>>
>> Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
>> izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
>> zip(*[iter(iterable)] * n)
>>
>> However I have only ever had use cases for 1 and 3 and these are not
>> currently possible without something additional (e.g. a generator
>> function).
>
> All of these are trivial. Start with the grouper recipe from the itertools
> documentation, which is your case 2) above, renaming if desired:
>
> http://docs.python.org/2/library/itertools.html#recipes
>
[snip]
>
> def chunk_short(n, iterable):  # Case 1) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel not in chunk:

The point is it would be good if you could avoid the check in the line
above and the overhead of having an if statement and a generator frame
between every next call. You can do better by writing the line above
as
    if chunk[-1] is not sentinel
but it is still a redundant check. zip_relaxed() would know when
StopIteration was raised and wouldn't need to retrospectively check
every chunk.

>             yield chunk
>         else:
>             i = chunk.index(sentinel)
>             yield chunk[:i]
>
>
> def chunk_strict(n, iterable):  # Case 3) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel in chunk:

if chunk[-1] is sentinel

>             raise ValueError
>         yield chunk
>
>
> def chunk(n, iterable):  # Case 4) above
>     args = [iter(iterable)]*n
>     return izip(*args)
>
>
> def chunk_other(n, iterable):  # I suck at thinking up names...
>     it = iter(iterable)
>     values = [next(it) for _ in range(n)]  # What if this is short?
>     while True:
>         yield tuple(values)
>         values.pop(0)
>         try:
>             values.append(next(it))
>         except StopIteration:
>             break

You can do this one somewhat wastefully with itertools instead of a
generator function:

def chunk_other(n, iterable):
    iterators = tee(iterable, n)
    for n, it in enumerate(iterators):
        for _ in range(n):
            try:
                next(it)
            except StopIteration:
                return iter([])
    return izip(*iterators)

I don't know which of those two would be faster but it's certainly
easier to understand the generator function.


Oscar



More information about the Python-list mailing list