Fatest standard way to sum bytes (and their squares)?

Peter Otten __peter__ at web.de
Sun Aug 12 06:04:47 EDT 2007


Peter Otten wrote:

> Erik Max Francis wrote:
> 
>> For a file hashing system (finding similar files, rather than identical
>> ones), I need to be able to efficiently and quickly sum the ordinals of
>> the bytes of a file and their squares.  Because of the nature of the
>> application, it's a requirement that I do it in Python, or only with
>> standard library modules (if such facilities exist) that might assist.
>> 
>> So far the fastest way I've found is using the `sum` builtin and
>> generators::
>> 
>> ordinalSum = sum(ord(x) for x in data)
>> ordinalSumSquared = sum(ord(x)**2 for x in data)
>> 
>> This is about twice as fast as an explicit loop, but since it's going to
>> be processing massive amounts of data, the faster the better.  Are there
>> any tricks I'm not thinking of, or perhaps helper functions in other
>> modules that I'm not thinking of?
> 
> Two ideas:
> 
> Use a lookup-table for ord(c)**2
> Use array.array()
> 
> $ cat summit.py
> import array
> 
> data = "dere gewizzede bizzede bizz" * 1000 + chr(255)
> 
> lookup  = dict((i, i**2) for i in range(256))

Oops, make that a list:

lookup  = [i**2 for i in range(256)]

> def summit_str(data=data):
>     return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)
> 
> def summit_array(data=data, lookup=lookup):
>     a = array.array("B")
>     a.fromstring(data)
>     return sum(a), sum(lookup[x] for x in a)
> 
> if __name__ == "__main__":
>     assert summit_array() == summit_str()
> 
> $ python -m timeit -s'from summit import summit_str as summit' 'summit()'
> 10 loops, best of 3: 32.2 msec per loop
> $ python -m timeit -s'from summit import summit_array as summit'
> 'summit()' 100 loops, best of 3: 13.4 msec per loop

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 12.9 msec per loop

Not a big gain, but still...

> Your actual code may be even faster because you can read the bytes
> directly from the file.
 
> Peter




More information about the Python-list mailing list