[New-bugs-announce] [issue10302] Add class-functions to hash many small objects with hashlib

Lukas Lueg report at bugs.python.org
Wed Nov 3 23:44:17 CET 2010


New submission from Lukas Lueg <lukas.lueg at gmail.com>:

The objects provided by hashlib mainly serve the purpose of computing hashes over strings of arbitrary size. The user gets a new object (e.g. hashlib.sha1()), calls .update() with chunks of data and then finally uses .digest() or .hexdigest() to get the hash. For convenience reasons these steps can also be done in almost one step (e.g. hashlib.sha1('foobar').hexdigest()).
While the above approach basically covers all use-cases for hash-functions, when computing hashes of many small strings it is yet inefficient (e.g. due to interpreter-overhead) and leaves out the possibility for performance improvements.

There are many cases where we need the hashes of numerous (small) objects, most or all of which being available in memory at the same time.

I therefor propose to extend the classes provided by hashlib with an additional function that takes an iterable object, computes the hash over the string representation of each member and returns the result. Due to the aim of this interface, the function is a member of the class (not the instance) and has therefor no state bound to an instance. Memory requirements are to be anticipated and met by the programmer.

For example:

foo = ['my_database_key1', 'my_database_key2']
hashlib.sha1.compute(foo) 
>> ('\x00\x00', '\xff\xff')


I consider this interface to hashlib particular useful, as we can take advantage of vector-based implementations that compute multiple hashes in one pass (e.g. through SSE2). GCC has a vector-extension that provides a *somewhat* standard way to write code that can get compiled to SSE2 or similar machine code. Examples of vector-based implementations of SHA1 and MD5 can be found at https://code.google.com/p/pyrit/issues/detail?id=207


Contigency plan: We compile to code iterating over OpenSSL's EVP-functions if compiler is other than GCC or SSE2 is not available. The same approach can be used to cover hashlib-objects for which we don't have an optimized implementation.

----------
components: Library (Lib)
messages: 120351
nosy: ebfe
priority: normal
severity: normal
status: open
title: Add class-functions to hash many small objects with hashlib
type: feature request
versions: Python 3.2, Python 3.3

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10302>
_______________________________________


More information about the New-bugs-announce mailing list