[Python-ideas] itertools.chunks()

Mon Apr 8 13:57:06 CEST 2013

On 8 April 2013 06:31, Wolfgang Maier
<wolfgang.maier at biologie.uni-freiburg.de> wrote:
> Oscar Benjamin <oscar.j.benjamin at ...> writes:
>>
>> On 7 April 2013 10:37, Wolfgang Maier
>> <wolfgang.maier at ...> wrote:
>> >>Also I find myself often writing helper functions like these:
>> >>
>> >>def chunked(sequence,size):
>> >>       i = 0
>> >>       while True:
>> >>               j = i
>> >>               i += size
>> >>               chunk = sequence[j:i]
>> >>               if not chunk:
>> >>                       return
>> >>               yield chunk
>> >
>> > This is just an alternate version of the grouper recipe from the itertools
>> > documentation, just that grouper should be way faster and will also work
>> > with iterators.
>>
>> It's not quite the same as grouper as it doesn't use fill values; I've
>> never found that I wanted fill values in this situation.
>>
>> Also I'm not sure why you think that grouper would be "way faster".
[snip]
>
> I didn't want to imply that slicing was faster/slower than iteration. Rather
> I thought that this particular example would run slower than the grouper
> recipe because of  the rest of the python code (assign, increment,
> testing for False every time through the loop). I have not tried to time it,
> but all this should make things slower than grouper, which spends most of
> its time in C. For the special case of ndarrays your argument sounds
> convincing though!

Fair enough. I was making the assumption that the chunk size is large
in which case the time is dominated by creating the slice.

> Regarding the differences between this code and grouper, I am well aware of
> them. It was for that reason that I was mentioning the earlier thread
> *zip_strict() or similar in itertools again, where Peter Otten shows an
> elegant alternative.

Sorry, I didn't read that thread but I have now; I see that you raised
precisely this issue. For what it's worth I agree that the fact a
generator is needed here suggests that there is some kind of primitive
missing from itertools.

Also, here's a version of the same from my own code (modified a
little) that uses islice instead of zip_longest. I haven't timed it
but it was intended to be fast for large chunk sizes and I'd be
interested to know how it compares:

from itertools import islice

def chunked(iterable, size, **kwargs):
    '''Breaks an iterable into chunks

    Usage:
        >>> list(chunked('qwertyuiop', 3))
        [['q', 'w', 'e'], ['r', 't', 'y'], ['u', 'i', 'o'], ['p']]

        >>> list(chunked('qwertyuiop', 3, fillvalue=None))
        [['q', 'w', 'e'], ['r', 't', 'y'], ['u', 'i', 'o'], ['p', None, None]]

        >>> list(chunked('qwertyuiop', 3, strict=True))
        Traceback (most recent call last):
            ...
        ValueError: Invalid chunk size
    '''
    list_, islice_ = list, islice
    iterator = iter(iterable)

    chunk = list_(islice_(iterator, size))
    while len(chunk) == size:
        yield chunk
        chunk = list_(islice_(iterator, size))

    if not chunk:
        return
    elif kwargs.get('strict', False):
        raise ValueError('Invalid chunk size')
    elif 'fillvalue' in kwargs:
        yield chunk + (size - len(chunk)) * [kwargs['fillvalue']]
    else:
        yield chunk

Oscar