don't need dictionary's keys - hash table?

Thu Jul 13 06:57:36 EDT 2006

"Fredrik Lundh" <fredrik at pythonware.com> writes:
> btw, since my brain is on vacation, could anyone who already has all
> the necessary background information in their head (Paul?) tell me
> if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash
> functions for a bloom filter is a good idea...

Erm, my brain isn't on vacation, it's just plain missing ;-).  However
I'd say there's nothing wrong in principle with chopping up sha/md5
output if you want to get several hashes as you're doing, assuming the
total hash length is enough.  The advantage over doing separate hashes
for each piece is optimization: you don't compute as many hashes.
But there's more flexibility with something like (untested):

   def multi_hash(data, i, size):    # find i'th slice of hash
     return sha.new('%x,%s'% (i, data)).digest()[:size]

Also, the idea of a Bloom filter is to use less memory than a hash
table, by not storing the keys.  In your implementation you chop the
sha/md5 output into 4-byte pieces, which means about 4 billion "slots"
in each set, which might be too many for a typical computer.  Also,
you end up storing each of those 4-byte values as a Python string.  So
I think in practice you'd want smaller sets, and to represent them as
bit vectors instead of as Python sets (untested):

  import array
  class bit_vector(object):
     def __init__(self, size_in_bits):
        nbytes = (size_in_bits + 7) // 8
        self.vec = array.array('B', '\0' * nbytes)

     def __getattr__(self, i):  # get i'th bit from vector
        a,b = divmod(i, 8)
        return bool(self.vec[a] & (1 << b))

     def __setattr__(self, i, value):   # set i'th bit in vector
        a,b = divmod(i, 8)
        v = int(bool(value))            # the actual bit to set, 1 or 0
        self.vec[a] = (self.vec[a] & ~(1 << b)) | (v << b)

If we use a bit vector representation we want the hash value to be an
integer (untested):

   def bloom_hash(data, i, size_in_bits):
      # my favorite way to get an integer value from sha
      h = int(sha.new('%x,%s'% (i, data)).hexdigest(), 16)
      return h % size_in_bits

We'd then implement the Bloom filter (untested):

   class bloom(object):
      def __init__(self, nlayers, layersize):
         self.nlayers, self.layersize = nlayers, layersize
         self.filter = [bit_vector(layersize) for i in xrange(nlayers)]

      def add(self, data):
         for i in xrange(self.nlayers):
            self.filter[i][bloom_hash(data, i, self.layersize)] = True

      def __contains__(self, data):
         return False not in \ 
           (self.filter[i][bloom_hash(data, i, self.layersize)] \
             for i in xrange(self.nlayers))

Finally, I think to use Bloom filters effectively you have to choose
nlayers and layersize carefully, according to the number of keys you
expect to see, the amount of memory you have available, and/or the
probability of false matches that you're willing to accept.  The
standard references (whatever they are) about Bloom filters probably
have some suitable formulas for computing these things.