Why chunks is not part of the python standard lib?

Thu May 2 04:53:50 EDT 2013

Steven D'Aprano <steve+comp.lang.python <at> pearwood.info> writes:

>
> > 1) Yield a shorter chunk
> > 2) Extend the chunk with fill values
> > 3) Raise an error
> > 4) Ignore the last chunk
> > 
> > Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
> > izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
> > zip(*[iter(iterable)] * n)
> > 
> > However I have only ever had use cases for 1 and 3 and these are not
> > currently possible without something additional (e.g. a generator
> > function).
> 
> All of these are trivial. Start with the grouper recipe from the itertools 
> documentation, which is your case 2) above, renaming if desired:
> 
> http://docs.python.org/2/library/itertools.html#recipes
> 
> def chunk_pad(n, iterable, fillvalue=None):
>     args = [iter(iterable)] * n
>     return izip_longest(fillvalue=fillvalue, *args)
> 
> Now define:
> 
> def chunk_short(n, iterable):  # Case 1) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel not in chunk:
>             yield chunk
>         else:
>             i = chunk.index(sentinel)
>             yield chunk[:i]
> 
> def chunk_strict(n, iterable):  # Case 3) above
>     sentinel = object()
>     for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
>         if sentinel in chunk:
>             raise ValueError
>         yield chunk
> 

These are only trivial on the surface. I brought up this topic on
python-ideas just weeks ago and it turns out there's a surprising numbers of
alternate solutions that people use for these two cases. Yours is
straightforward and simple, but comes at the price of the if sentinel clause
being checked repeatedly. An optimized version suggested by Peter Otten
replaces your for loop by:

chunk_pad = zip_longest(*args, fillvalue=fillvalue)
prev = next(chunks)

for chunk in chunk_pad:
    yield prev
    prev = chunk

then doing the sentinel test only once at the end.

>> 1) Yield a shorter chunk
>> 2) Extend the chunk with fill values
>> 3) Raise an error
>> 4) Ignore the last chunk
>> 
>> Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
>> izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
>> zip(*[iter(iterable)] * n)

In my opinion, it would make sense to have the 4 cases suggested by Oscar
covered by itertools. As he says, cases 2 and 4 are already (and there is
the grouper recipe in itertools giving the solution for case 2). It would
prevent people from reinventing (often suboptimal) solutions to these common
problems and it would bring a speed-gain even compared to the best Python
implementations since things would be coded in C.

I would advocate for either of the following two solutions:

a) have an extra 'mode'-type argument for zip_longest() to control its
behavior (default mode could be the current fillvalue padding, 'strict' mode
would raise an error, and 'relaxed' mode would yield the shorter chunk.
or
b) have extra zip_strict and zip_relaxed (I'm also not too good at thinking
up names :)) functions in itertools

Either way, you could now very easily modify the existing grouper recipe in
itertools to implement the four different 'chunks' functions (I would keep
calling them 'grouper' functions in line with the current itertools version).

> I think the real reasons it's not in the standard library are:
> 
> - there's no consensus on what chunking should do;
> 
> - hence whatever gets added will disappoint some people;
> 
> - unless you add "all of them", in which case you've now got a 
> significantly harder API ("there are five chunk functions in itertools, 
> which should I use?");
>
> What does chunking (grouping) even mean? Given:
> 
> chunk("abcdef", 3)
> 
> should I get this?  [abc, def]
> 
> or this?  [abc, bcd, cde, def]

I guess this suggestion would not compromise the API too much, after all,
all these zip versions would still behave like a zip() function should.
Itertools users would also know what a 'grouper' recipe does, i.e., it
doesn't do the fancier alternative stuff you suggest like rejoining groups
of characters obtained from a string. So this would be a relatively
conservative addition.

What do you think?
Wolfgang