[Spambayes] Re: Collecting word lists.. - BUMMER
T. Alexander Popiel
popiel at wolfskeep.com
Sun May 25 14:44:30 EDT 2003
In message: <3ED0EBA5.15635.1DF18942 at localhost>
"Brad Clements" <bkc at murkworks.com> writes:
>Ok, everyone say "I told you so".
"I told you so", even though I didn't expect this sort of result.
(I thought it would be pointless, not impossible.)
>Something seems amiss with my analysis, I just find it hard to believe
>that users have so few words in common.
I also think this is extremely peculiar.
>Either my code is bad, the sha collection process is flawed or there
>really isn't much in common, but I didn't expect it to be this bad.
I don't see anything wrong with your analysis code, so I'm beginning
to suspect the collection process is flawed. I'm perfectly willing to
make my non-SHA'd wordlist available for verification. If there's more
than 71131 words in common between my wordlist and your wordlist, then
we _know_ there's something wrong with the collection process.
>First comparing personal corpuses that all claim to be english.
>
>Loaded 7 wordlists with 947127 distinct words out of 1135946 total words
>Word counts by number of collections each word is seen in
>Col. # Words % of unique words
>1 823450 86.9%
>2 71131 7.5%
>3 40919 4.3%
>4 10690 1.1%
>5 905 0.1%
>6 32 0.0%
>7 0 0.0%
Now this is just too strange to believe... for English mail, there
should be at least about 5000 words in common among everybody; that's
the size of the everyday usage English vocabulary. This number should
even be bloated a bit by aliasing due to punctuation.
- Alex
More information about the Spambayes
mailing list