[spambayes-dev] RE: [Spambayes] How low can you go?

Thu Dec 18 20:25:16 EST 2003

> {Seth Goodman}
> > All your arguments on this point make lots of sense.
> > I'm a little surprised that you had significant
> > collisions mapping perhaps 100K items (my guess)
> > into a 32-bit space.  I think that is rather dependent
> > on the hash used, but that's what you saw.
>
> [Ryan Malayter]
> That's not surprising at all to me. Because of the "birthday paradox",
> even very input-sensitive (random-looking) hash functions like the
> 160-bit SHA-1 only give 80 bits of collision resistance. With a 32 bit
> perfect hash, you get just 16 bits of collision resistance. That means
> there is a 50% chance of a collision if you hash just 65,536 items. Hash
> more items than that, and your chances of collision go up further.
>
> If your hash function isn't perfectly (randomly) distributed in the
> 32-bit space, things could be much worse with 100,000 hashes in a
> collection.

As I understand it, the birthday paradox leads to the conclusion that for a
32-bit perfect hash function, after hashing around 78,000 items (just over
16-bits worth), you are likely to experience a _single_ collision.  What Tim
described sounded like they probably had multiple collisions to account for
the spectacular failures they saw.  I don't know the size of the token
databases they dealt with back then, but I doubt a single collision in a
token list of 78K items would affect the classifier.  Since most of the
tokens are hapaxes anyway (perhaps 80-90% ?), it is most probable that there
would be no visible effect.

You are of course correct that going over 78K items limit would give more
collisions, but it would take quite a few collisions for one of the
colliding tokens to be something other than a hapax.  I am guessing that
unless there were a lot more than 100K tokens, the 32-bit hash function used
probably didn't do as good a randomizing job as needed.

Since they ultimately had to construct a map of hash_value <-> token_string,
they could have detected collisions (check the token already stored with the
hash value) and done something about it (i.e. use next empty bucket).  Since
this would be a rare event, it wouldn't have cost much.  In any case, Tim's
idea of a mapping token_string <-> feature_ID (i.e. sequentially allocated
number with "wrap-around") sounds much simpler.  However, it is important
that the number has enough bits that previously allocated feature_ID's are
ready to be reused (their tokens expired) by the time the allocation number
"wraps around" to them.  This just means that the number should probably be
32-bits.  Assuming you generate 100K tokens per day, the wrap-around time
for a 32-bit number is 117 years.  For a 24-bit number and the same rate of
token production, the wrap-around time is 167 days (around 5.5 months).  I'd
go for the 32-bit number and not worry about pathological operating schemes
or new tokenizers.  Even at 1 million new tokens per day, the wrap-around
time for a 32-bit feature_ID is over 10 years.  Why hash when you can
sequentially allocate?  This was just a bad idea on my part.  And it won't
be the last one :)

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above