[Spambayes] Re: Collecting word lists.. - BUMMER

T. Alexander Popiel popiel at wolfskeep.com
Sun May 25 14:44:30 EDT 2003


In message:  <3ED0EBA5.15635.1DF18942 at localhost>
             "Brad Clements" <bkc at murkworks.com> writes:

>Ok, everyone say "I told you so".

"I told you so", even though I didn't expect this sort of result.
(I thought it would be pointless, not impossible.)

>Something seems amiss with my analysis, I just find it hard to believe
>that users have so few words in common.

I also think this is extremely peculiar.

>Either my code is bad, the sha collection process is flawed or there
>really isn't much in common, but I didn't expect it to be this bad.

I don't see anything wrong with your analysis code, so I'm beginning
to suspect the collection process is flawed.  I'm perfectly willing to
make my non-SHA'd wordlist available for verification.  If there's more
than 71131 words in common between my wordlist and your wordlist, then
we _know_ there's something wrong with the collection process.

>First comparing personal corpuses that all claim to be english.
>
>Loaded 7 wordlists with 947127 distinct words out of 1135946 total words
>Word counts by number of collections each word is seen in
>Col.  # Words % of unique words
>1     823450 86.9%
>2      71131 7.5%
>3      40919 4.3%
>4      10690 1.1%
>5        905 0.1%
>6         32 0.0%
>7          0 0.0%

Now this is just too strange to believe... for English mail, there
should be at least about 5000 words in common among everybody; that's
the size of the everyday usage English vocabulary.  This number should
even be bloated a bit by aliasing due to punctuation.

- Alex



More information about the Spambayes mailing list