Implementing file reading in C/Python

Sat Jan 10 12:39:55 EST 2009

On Fri, 09 Jan 2009 15:34:17 +0000, MRAB wrote:

> Marc 'BlackJack' Rintsch wrote:
>> On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote:
>> 
>>> As this was horribly slow (20 Minutes for a 2GB file) I coded the whole
>>> thing in C also:
>> 
>> Yours took ~37 minutes for 2 GiB here.  This "just" ~15 minutes:
>> 
>> #!/usr/bin/env python
>> from __future__ import division, with_statement
>> import os
>> import sys
>> from collections import defaultdict
>> from functools import partial
>> from itertools import imap
>> 
>> 
>> def iter_max_values(blocks, block_count):
>>     for i, block in enumerate(blocks):
>>         histogram = defaultdict(int)
>>         for byte in block:
>>             histogram[byte] += 1
>>         
>>         yield max((count, byte)
>>                   for value, count in histogram.iteritems())[1]
>>         
> [snip]
> Would it be faster if histogram was a list initialised to [0] * 256?

I tried it on my computer, also getting character codes with
struct.unpack, like this:

        histogram = [0,]*256

        for byte in struct.unpack( '%dB'%len(block), block ): 
            histogram[byte] +=1 

        yield max(( count, byte ) 
                  for idx, count in enumerate(histogram))[1] 

and I also removed the map( ord ... ) statement in main program, since
iter_max_values mow returns character codes directly.

The result is 10 minutes against the 13 of the original 'BlackJack's code
on my PC (iMac Intel python 2.6.1).

Strangely, using histogram = array.array( 'i',  [0,]*256 ) gives again 13
minutes, even if I create the array outside the loop and then use 
  histogram[:] = zero_array to reset the values.

Ciao
-----
FB