Refactor a buffered class...

Thu Sep 7 15:52:16 EDT 2006

George Sakkis wrote:
> Michael Spencer wrote:
>> George Sakkis wrote:
>>> Michael Spencer wrote:
>>>
>>>> def chunker(s, chunk_size=3, sentry=".", keep_first = False, keep_last = False):
>>>>      buffer=[]
>> ...
>>> And here's a (probably) more efficient version, using a deque as a
>>> buffer:
>>>
>> Perhaps the deque-based solution is more efficient under some conditions, but
>> it's significantly slower for all the cases I tested:
> 

> As it turns out, none of chunk_size,
> words_per_group and word_length are taken into account in your tests;
> they all have their default values.

Hello George

Yep, you're right, the test was broken.  chunkerGS beats chunkerMS handily, in 
some cases, in particular for large chunk_size.

> Second, the output of the two functions is different, so you're not
> comparing apples to apples:

And, to be fair, neither meets the OP spec for joined output

 > Third, and most important for the measured difference, is that the
 > performance hit in my function came from joining the words of each
 > group (['check', 'if', 'it'] -> 'check if it') every time it is
 > yielded. If the groups are left unjoined as in Michael's version, the
 > results are quite different:

Second and Third are basically the same point i.e., join dominates the 
comparison. But your function *needs* an extra join to get the OP's specified 
output.

I think the two versions below each give the 'correct' output wrt to the OP's 
single test case.  I measure chunkerMS2 to be faster than chunkerGS2 across all 
chunk sizes, but this is all about the joins.

I conclude that chunkerGS's deque beats chunkerMS's list for large chunk_size (~ 
 >100).  But for joined output, chunkerMS2 beats chunkerGS2 because it does less 
joining.

 > if you're going to profile something, better use the
 > standard timeit module
...
OT: I will when timeit grows a capability for testing live objects rather than 
'small code snippets'.  Requiring source code input and passing arguments by 
string substitution makes it too painful for interactive work.  The need to 
specify the number of repeats is an additional annoyance.

Cheers

Michael

#Revised functions with joined output, per OP spec

def chunkerMS2(s, chunk_size=3, sentry=".", keep_first = False, keep_last = False):
     buffer=[]
     sentry_count = 0

     for item in s:
         buffer.append(item)
         if item == sentry:
             sentry_count += 1
             if sentry_count < chunk_size:
                 if keep_first:
                     yield " ".join(buffer)
             else:
                 yield " ".join(buffer)
                 del buffer[:buffer.index(sentry)+1]

     if keep_last:
         while buffer:
             yield " ".join(buffer)
             del buffer[:buffer.index(sentry)+1]

def chunkerGS2(seq, sentry='.', chunk_size=3, keep_first=False,
keep_last=False):
     def format_chunks(chunks):
         return " . ".join(' '.join(chunk) for chunk in chunks) + " ."
     iterchunks = itersplit(seq,sentry)
     buf = deque()
     for chunk in islice(iterchunks, chunk_size-1):
         buf.append(chunk)
         if keep_first:
             yield format_chunks(buf)
     for chunk in iterchunks:
         buf.append(chunk)
         yield format_chunks(buf)
         buf.popleft()
     if keep_last:
         while buf:
             yield format_chunks(buf)
             buf.popleft()