Implementing file reading in C/Python
Francesco Bochicchio
bockman at virgilio.it
Sat Jan 10 12:39:55 EST 2009
On Fri, 09 Jan 2009 15:34:17 +0000, MRAB wrote:
> Marc 'BlackJack' Rintsch wrote:
>> On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote:
>>
>>> As this was horribly slow (20 Minutes for a 2GB file) I coded the whole
>>> thing in C also:
>>
>> Yours took ~37 minutes for 2 GiB here. This "just" ~15 minutes:
>>
>> #!/usr/bin/env python
>> from __future__ import division, with_statement
>> import os
>> import sys
>> from collections import defaultdict
>> from functools import partial
>> from itertools import imap
>>
>>
>> def iter_max_values(blocks, block_count):
>> for i, block in enumerate(blocks):
>> histogram = defaultdict(int)
>> for byte in block:
>> histogram[byte] += 1
>>
>> yield max((count, byte)
>> for value, count in histogram.iteritems())[1]
>>
> [snip]
> Would it be faster if histogram was a list initialised to [0] * 256?
I tried it on my computer, also getting character codes with
struct.unpack, like this:
histogram = [0,]*256
for byte in struct.unpack( '%dB'%len(block), block ):
histogram[byte] +=1
yield max(( count, byte )
for idx, count in enumerate(histogram))[1]
and I also removed the map( ord ... ) statement in main program, since
iter_max_values mow returns character codes directly.
The result is 10 minutes against the 13 of the original 'BlackJack's code
on my PC (iMac Intel python 2.6.1).
Strangely, using histogram = array.array( 'i', [0,]*256 ) gives again 13
minutes, even if I create the array outside the loop and then use
histogram[:] = zero_array to reset the values.
Ciao
-----
FB
More information about the Python-list
mailing list