Fatest standard way to sum bytes (and their squares)?

Sun Aug 12 06:00:14 EDT 2007

Erik Max Francis wrote:

> For a file hashing system (finding similar files, rather than identical
> ones), I need to be able to efficiently and quickly sum the ordinals of
> the bytes of a file and their squares.  Because of the nature of the
> application, it's a requirement that I do it in Python, or only with
> standard library modules (if such facilities exist) that might assist.
> 
> So far the fastest way I've found is using the `sum` builtin and
> generators::
> 
> ordinalSum = sum(ord(x) for x in data)
> ordinalSumSquared = sum(ord(x)**2 for x in data)
> 
> This is about twice as fast as an explicit loop, but since it's going to
> be processing massive amounts of data, the faster the better.  Are there
> any tricks I'm not thinking of, or perhaps helper functions in other
> modules that I'm not thinking of?

Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup  = dict((i, i**2) for i in range(256))

def summit_str(data=data):
    return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
    a = array.array("B")
    a.fromstring(data)
    return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
    assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 13.4 msec per loop

Your actual code may be even faster because you can read the bytes directly
from the file.

Peter