Implementing file reading in C/Python

Sat Jan 10 02:44:31 EST 2009

On Jan 9, 2:14 pm, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Fri, 09 Jan 2009 15:34:17 +0000, MRAB wrote:
> > Marc 'BlackJack' Rintsch wrote:
>
> >> def iter_max_values(blocks, block_count):
> >>     for i, block in enumerate(blocks):
> >>         histogram = defaultdict(int)
> >>         for byte in block:
> >>             histogram[byte] += 1
>
> >>         yield max((count, byte)
> >>                   for value, count in histogram.iteritems())[1]
>
> > [snip]
> > Would it be faster if histogram was a list initialised to [0] * 256?
>
> Don't know.  Then for every byte in the 2 GiB we have to call `ord()`.  
> Maybe the speedup from the list compensates this, maybe not.
>
> I think that we have to to something with *every* byte of that really
> large file *at Python level* is the main problem here.  In C that's just
> some primitive numbers.  Python has all the object overhead.

struct's B format might help here.  Also, struct.unpack_from could
probably be combined with mmap to avoid copying the input.  Not to
mention that the 0..256 ints are all saved and won't be allocated/
deallocated.