Why chunks is not part of the python standard lib?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu May 2 01:15:38 EDT 2013


On Wed, 01 May 2013 10:00:04 +0100, Oscar Benjamin wrote:

> On 1 May 2013 08:10, Mark Lawrence <breamoreboy at yahoo.co.uk> wrote:
>> On 01/05/2013 07:26, Ricardo Azpeitia Pimentel wrote:
>>>
>>> After reading How do you split a list into evenly sized chunks in
>>> Python?
>>>
>>> <http://stackoverflow.com/questions/312443/how-do-you-split-a-list-
into-evenly-sized-chunks-in-python>
>>>
>>> and seeing this kind of mistakes happening
>>> https://code.djangoproject.com/ticket/18972 all the time.

That bug is irrelevant to the question about chunking a sequence or 
iterator.


>>> Why is not a |chunks| function in itertools?
[...]
>> Asked and answered a trillion times.  There's no concensus on how
>> chucks should behave.
> 
> I'm not sure that's a valid argument against it since a chunks function
> could just do a different thing depending on the arguments given.

Yes, but that's a rubbish API. I'd rather have five separate functions. 
Or maybe five methods on a single Chunker object.

I think the real reasons it's not in the standard library are:

- there's no consensus on what chunking should do;

- hence whatever gets added will disappoint some people;

- unless you add "all of them", in which case you've now got a 
significantly harder API ("there are five chunk functions in itertools, 
which should I use?"); 

- and none of them are actually very hard to write.


So the prospect of adding chunks somewhere is unattractive: lots of angst 
for very little benefit. Yes, it could be done, but none of the Python 
developers, and especially not the itertools "owner", Raymond Hettinger, 
think that the added complication is worth the benefit.


> The issue is around how to deal with the last chunk if it isn't the same
> length as the others and I can only think of 4 reasonable responses:

That's not the only issue. What does chunking (grouping) even mean? Given:

chunk("abcdef", 3)

should I get this?  [abc, def]

or this?  [abc, bcd, cde, def]


There are good use-cases for both.

If given a string, should chunking re-join the individual characters into 
strings, or leave them as lists of chars? Tuples of chars? E.g.

chunk("abcdef", 3) => "abc" ...

or ["a", "b", "c"] ...

How about bytes?

I have opinions on these questions, but I'm not going to give them to 
you. The point is that chunking means different things to different 
people. If you write your own, you get to pick whatever behaviour you 
like, instead of trying to satisfy everyone.


> 1) Yield a shorter chunk
> 2) Extend the chunk with fill values
> 3) Raise an error
> 4) Ignore the last chunk
> 
> Cases 2 and 4 can be achieved with current itertools primitives e.g.: 2)
> izip_longest(fillvalue=fillvalue, *[iter(iterable)] * n) 4)
> zip(*[iter(iterable)] * n)
> 
> However I have only ever had use cases for 1 and 3 and these are not
> currently possible without something additional (e.g. a generator
> function).

All of these are trivial. Start with the grouper recipe from the itertools 
documentation, which is your case 2) above, renaming if desired:

http://docs.python.org/2/library/itertools.html#recipes


def chunk_pad(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)


Now define:


def chunk_short(n, iterable):  # Case 1) above
    sentinel = object()
    for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
        if sentinel not in chunk:
            yield chunk
        else:
            i = chunk.index(sentinel)
            yield chunk[:i]


def chunk_strict(n, iterable):  # Case 3) above
    sentinel = object()
    for chunk in chunk_pad(n, iterable, fillvalue=sentinel):
        if sentinel in chunk:
            raise ValueError
        yield chunk


def chunk(n, iterable):  # Case 4) above
    args = [iter(iterable)]*n
    return izip(*args)


def chunk_other(n, iterable):  # I suck at thinking up names...
    it = iter(iterable)
    values = [next(it) for _ in range(n)]  # What if this is short?
    while True:
        yield tuple(values)
        values.pop(0)
        try:
            values.append(next(it))
        except StopIteration:
            break
        

-- 
Steven



More information about the Python-list mailing list