Today's fun and educational Python recipe

Raymond Hettinger python at rcn.com
Wed May 4 17:39:28 EDT 2011


On May 4, 12:42 pm, Terry Reedy <tjre... at udel.edu> wrote:
> On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
>
> > Here's a 22-line beauty for a classic and amazing algorithm:
> >http://bit.ly/bloom_filter
>
> > The wiki article on the algorithm is brief and well-written:
> >http://en.wikipedia.org/wiki/Bloom_filter
>
> As I understand the article, the array of num_bits should have
> num_probes (more or less independent) bits set for each key. But as I
> understand the code
>
>          for i in range(self.num_probes):
>              h, array_index = divmod(h, num_words)
>              h, bit_index = divmod(h, 32)
>              yield array_index, 1 << bit_index
>
> the same bit is being set or tested num_probes times. The last three
> lines have no dependence on i that I can see, so they appear to do the
> same thing each time. This seems like a bug.

The 512 bits in h are progressively eaten-up between iterations.  So
each pass yields a different (array index, bit_mask) pair.

It's easy to use the interactive prompt to show that different probes
are produced on each pass:

>>> bf = BloomFilter(num_bits=1000, num_probes=8)
>>> pprint(list(bf.get_probes('Alabama')))
[(19, 1073741824),
 (11, 64),
 (9, 134217728),
 (25, 1024),
 (24, 33554432),
 (6, 16),
 (7, 16777216),
 (22, 1048576)]

The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
a cryptographic hash ;)

The fifty state example in the recipe is a reasonable demonstration
that the recipe works as advertised.  It successfully finds all fifty
states (the true positives) and it tries 100,000 negatives resulting
in only a handful of false negatives.  That should be somewhat
convincing that it all works.


Raymond

-------
follow my other python tips and recipes on twitter:  @raymondh



More information about the Python-list mailing list