[Spambayes] Frequency distribution for wordinfo counts?

Tim Peters tim.one at comcast.net
Sun Feb 15 00:12:08 EST 2004


[Brad Clements]
> I'd like to get feedback from folks on the distribution of nham and
> nspam counts in their wordinfo databases.
>
> For example, I used sb_dbexpimp to dump my dbm based storage, then
> loaded it into excel and did a histogram on nham and nspam.
>
> ...
> Anyway, what I'm interested in is the number of tokens whose nspam
> or nham count is greater than 255 vs the total number of tokens and
> ham and spam count.

Here's mine today; I'm using bigrams:

nham 674 nspam 621

ham counts
value  #times   cumm      %   cumm%
     0 107422 107422  47.24  47.24
     1 103431 210853  45.49  92.73
     2   8304 219157   3.65  96.38
     3   2797 221954   1.23  97.61
     4   1508 223462   0.66  98.28
     5    852 224314   0.37  98.65
     6    513 224827   0.23  98.88
     7    359 225186   0.16  99.03
     8    259 225445   0.11  99.15
     9    203 225648   0.09  99.24
    10    192 225840   0.08  99.32
    11    117 225957   0.05  99.37
    12    139 226096   0.06  99.43
    13     90 226186   0.04  99.47
    14     90 226276   0.04  99.51
    15     91 226367   0.04  99.55
    16     77 226444   0.03  99.59
    17     66 226510   0.03  99.62
    18     63 226573   0.03  99.64
    19     52 226625   0.02  99.67
    20     30 226655   0.01  99.68
    21     34 226689   0.01  99.70
    22     39 226728   0.02  99.71
    23     31 226759   0.01  99.73
    24     30 226789   0.01  99.74
    25     25 226814   0.01  99.75
    26     16 226830   0.01  99.76
    27     20 226850   0.01  99.77
    28     20 226870   0.01  99.77
    29     22 226892   0.01  99.78
    30     18 226910   0.01  99.79
    31     14 226924   0.01  99.80
    32     16 226940   0.01  99.81
    33     16 226956   0.01  99.81
    34     13 226969   0.01  99.82
    35     12 226981   0.01  99.82
    36      9 226990   0.00  99.83
    37      9 226999   0.00  99.83
    38      9 227008   0.00  99.84
    39     12 227020   0.01  99.84
    40      6 227026   0.00  99.84
    41      9 227035   0.00  99.85
    42      7 227042   0.00  99.85
    43      2 227044   0.00  99.85
    44      5 227049   0.00  99.85
    45      4 227053   0.00  99.86
    46      8 227061   0.00  99.86
    47      5 227066   0.00  99.86
    48      8 227074   0.00  99.86
    49      5 227079   0.00  99.87
    50      4 227083   0.00  99.87
    51      5 227088   0.00  99.87
    52      9 227097   0.00  99.87
    53      6 227103   0.00  99.88
    54      5 227108   0.00  99.88
    55      8 227116   0.00  99.88
    56      4 227120   0.00  99.88
    58      1 227121   0.00  99.89
    59      5 227126   0.00  99.89
    61      4 227130   0.00  99.89
    62      2 227132   0.00  99.89
    63      5 227137   0.00  99.89
    64      3 227140   0.00  99.89
    65      2 227142   0.00  99.89
    66      1 227143   0.00  99.89
    67      1 227144   0.00  99.90
    68      3 227147   0.00  99.90
    69      2 227149   0.00  99.90
    70      3 227152   0.00  99.90
    71      3 227155   0.00  99.90
    72      2 227157   0.00  99.90
    73      3 227160   0.00  99.90
    74      4 227164   0.00  99.90
    75      1 227165   0.00  99.90
    76      3 227168   0.00  99.91
    77      3 227171   0.00  99.91
    78      2 227173   0.00  99.91
    79      1 227174   0.00  99.91
    80      3 227177   0.00  99.91
    81      2 227179   0.00  99.91
    82      4 227183   0.00  99.91
    83      5 227188   0.00  99.91
    84      2 227190   0.00  99.92
    85      4 227194   0.00  99.92
    87      2 227196   0.00  99.92
    88      4 227200   0.00  99.92
    89      2 227202   0.00  99.92
    90      3 227205   0.00  99.92
    91      3 227208   0.00  99.92
    92      2 227210   0.00  99.92
    93      1 227211   0.00  99.92
    94      2 227213   0.00  99.93
    95      3 227216   0.00  99.93
    96      1 227217   0.00  99.93
    98      1 227218   0.00  99.93
   100      8 227226   0.00  99.93
   101      2 227228   0.00  99.93
   102      1 227229   0.00  99.93
   103      3 227232   0.00  99.93
   104      1 227233   0.00  99.93
   105      3 227236   0.00  99.94
   107      1 227237   0.00  99.94
   108      2 227239   0.00  99.94
   109      1 227240   0.00  99.94
   111      1 227241   0.00  99.94
   114      2 227243   0.00  99.94
   115      5 227248   0.00  99.94
   118      2 227250   0.00  99.94
   120      3 227253   0.00  99.94
   123      2 227255   0.00  99.94
   125      1 227256   0.00  99.94
   126      1 227257   0.00  99.95
   127      1 227258   0.00  99.95
   129      1 227259   0.00  99.95
   132      2 227261   0.00  99.95
   133      1 227262   0.00  99.95
   135      1 227263   0.00  99.95
   138      1 227264   0.00  99.95
   139      1 227265   0.00  99.95
   140      1 227266   0.00  99.95
   142      3 227269   0.00  99.95
   144      1 227270   0.00  99.95
   149      4 227274   0.00  99.95
   153      1 227275   0.00  99.95
   155      1 227276   0.00  99.95
   156      2 227278   0.00  99.95
   157      1 227279   0.00  99.95
   158      2 227281   0.00  99.96
   160      1 227282   0.00  99.96
   163      2 227284   0.00  99.96
   165      1 227285   0.00  99.96
   166      1 227286   0.00  99.96
   179      1 227287   0.00  99.96
   181      1 227288   0.00  99.96
   185      2 227290   0.00  99.96
   192      2 227292   0.00  99.96
   195      3 227295   0.00  99.96
   202      3 227298   0.00  99.96
   203      1 227299   0.00  99.96
   204      1 227300   0.00  99.96
   211      1 227301   0.00  99.96
   213      1 227302   0.00  99.96
   219      1 227303   0.00  99.97
   225      1 227304   0.00  99.97
   227      1 227305   0.00  99.97
   232      1 227306   0.00  99.97
   237      1 227307   0.00  99.97
   238      1 227308   0.00  99.97
   246      1 227309   0.00  99.97
   252      1 227310   0.00  99.97
   257      1 227311   0.00  99.97
   260      1 227312   0.00  99.97
   269      1 227313   0.00  99.97
   270      2 227315   0.00  99.97
   272      1 227316   0.00  99.97
   273      1 227317   0.00  99.97
   274      1 227318   0.00  99.97
   275      2 227320   0.00  99.97
   279      1 227321   0.00  99.97
   286      1 227322   0.00  99.97
   308      1 227323   0.00  99.97
   314      1 227324   0.00  99.97
   320      2 227326   0.00  99.98
   321      1 227327   0.00  99.98
   322      1 227328   0.00  99.98
   329      1 227329   0.00  99.98
   345      1 227330   0.00  99.98
   350      1 227331   0.00  99.98
   352      1 227332   0.00  99.98
   353      1 227333   0.00  99.98
   360      4 227337   0.00  99.98
   366      1 227338   0.00  99.98
   375      2 227340   0.00  99.98
   380      1 227341   0.00  99.98
   382      1 227342   0.00  99.98
   389      1 227343   0.00  99.98
   398      1 227344   0.00  99.98
   401      7 227351   0.00  99.99
   409      1 227352   0.00  99.99
   424      1 227353   0.00  99.99
   446      9 227362   0.00  99.99
   450      2 227364   0.00  99.99
   456      1 227365   0.00  99.99
   465      1 227366   0.00  99.99
   466      1 227367   0.00  99.99
   493      2 227369   0.00  99.99
   515      1 227370   0.00  99.99
   519      1 227371   0.00 100.00
   542      1 227372   0.00 100.00
   545      1 227373   0.00 100.00
   562      1 227374   0.00 100.00
   573      1 227375   0.00 100.00
   583      1 227376   0.00 100.00
   621      1 227377   0.00 100.00
   673      2 227379   0.00 100.00
   674      3 227382   0.00 100.00

spam counts
value  #times   cumm      %   cumm%
     0 104332 104332  45.88  45.88
     1 108911 213243  47.90  93.78
     2   7225 220468   3.18  96.96
     3   2368 222836   1.04  98.00
     4   1190 224026   0.52  98.52
     5    692 224718   0.30  98.83
     6    486 225204   0.21  99.04
     7    305 225509   0.13  99.18
     8    280 225789   0.12  99.30
     9    183 225972   0.08  99.38
    10    152 226124   0.07  99.45
    11    127 226251   0.06  99.50
    12     76 226327   0.03  99.54
    13     73 226400   0.03  99.57
    14     76 226476   0.03  99.60
    15     74 226550   0.03  99.63
    16     45 226595   0.02  99.65
    17     58 226653   0.03  99.68
    18     52 226705   0.02  99.70
    19     38 226743   0.02  99.72
    20     42 226785   0.02  99.74
    21     26 226811   0.01  99.75
    22     20 226831   0.01  99.76
    23     35 226866   0.02  99.77
    24     23 226889   0.01  99.78
    25     20 226909   0.01  99.79
    26     20 226929   0.01  99.80
    27     24 226953   0.01  99.81
    28     13 226966   0.01  99.82
    29     14 226980   0.01  99.82
    30      6 226986   0.00  99.83
    31     11 226997   0.00  99.83
    32     19 227016   0.01  99.84
    33      9 227025   0.00  99.84
    34      6 227031   0.00  99.85
    35     11 227042   0.00  99.85
    36      9 227051   0.00  99.85
    37     10 227061   0.00  99.86
    38      5 227066   0.00  99.86
    39      5 227071   0.00  99.86
    40      3 227074   0.00  99.86
    41      7 227081   0.00  99.87
    42     10 227091   0.00  99.87
    43      5 227096   0.00  99.87
    44      7 227103   0.00  99.88
    45      7 227110   0.00  99.88
    46      1 227111   0.00  99.88
    47      4 227115   0.00  99.88
    48      6 227121   0.00  99.89
    49      4 227125   0.00  99.89
    50      5 227130   0.00  99.89
    51      1 227131   0.00  99.89
    52      2 227133   0.00  99.89
    53      1 227134   0.00  99.89
    54      2 227136   0.00  99.89
    55      4 227140   0.00  99.89
    56      4 227144   0.00  99.90
    57      5 227149   0.00  99.90
    58      5 227154   0.00  99.90
    59      8 227162   0.00  99.90
    60      3 227165   0.00  99.90
    61      2 227167   0.00  99.91
    62      4 227171   0.00  99.91
    63      5 227176   0.00  99.91
    64      2 227178   0.00  99.91
    65      2 227180   0.00  99.91
    66      2 227182   0.00  99.91
    68      5 227187   0.00  99.91
    69      1 227188   0.00  99.91
    70      1 227189   0.00  99.92
    71      1 227190   0.00  99.92
    72      4 227194   0.00  99.92
    74      2 227196   0.00  99.92
    75      4 227200   0.00  99.92
    76      1 227201   0.00  99.92
    77      3 227204   0.00  99.92
    78      2 227206   0.00  99.92
    79      5 227211   0.00  99.92
    80      4 227215   0.00  99.93
    81      4 227219   0.00  99.93
    84      1 227220   0.00  99.93
    85      3 227223   0.00  99.93
    86      1 227224   0.00  99.93
    87      3 227227   0.00  99.93
    88      3 227230   0.00  99.93
    89      3 227233   0.00  99.93
    90      3 227236   0.00  99.94
    91      1 227237   0.00  99.94
    92      1 227238   0.00  99.94
    93      2 227240   0.00  99.94
    94      1 227241   0.00  99.94
    95      2 227243   0.00  99.94
    97      2 227245   0.00  99.94
    98      1 227246   0.00  99.94
   100      1 227247   0.00  99.94
   101      3 227250   0.00  99.94
   102      2 227252   0.00  99.94
   103      1 227253   0.00  99.94
   104      2 227255   0.00  99.94
   105      3 227258   0.00  99.95
   107      4 227262   0.00  99.95
   109      2 227264   0.00  99.95
   111      1 227265   0.00  99.95
   113      2 227267   0.00  99.95
   115      1 227268   0.00  99.95
   120      1 227269   0.00  99.95
   121      1 227270   0.00  99.95
   123      2 227272   0.00  99.95
   126      1 227273   0.00  99.95
   128      1 227274   0.00  99.95
   130      1 227275   0.00  99.95
   131      1 227276   0.00  99.95
   132      1 227277   0.00  99.95
   138      3 227280   0.00  99.96
   140      2 227282   0.00  99.96
   141      2 227284   0.00  99.96
   145      1 227285   0.00  99.96
   147      1 227286   0.00  99.96
   148      3 227289   0.00  99.96
   151      1 227290   0.00  99.96
   152      4 227294   0.00  99.96
   154      1 227295   0.00  99.96
   157      1 227296   0.00  99.96
   158      1 227297   0.00  99.96
   159      1 227298   0.00  99.96
   163      2 227300   0.00  99.96
   164      1 227301   0.00  99.96
   167      7 227308   0.00  99.97
   176     11 227319   0.00  99.97
   178      1 227320   0.00  99.97
   181      1 227321   0.00  99.97
   183      1 227322   0.00  99.97
   185      1 227323   0.00  99.97
   187      1 227324   0.00  99.97
   191      2 227326   0.00  99.98
   194      2 227328   0.00  99.98
   198      1 227329   0.00  99.98
   205      1 227330   0.00  99.98
   216      1 227331   0.00  99.98
   218      1 227332   0.00  99.98
   234      1 227333   0.00  99.98
   236      1 227334   0.00  99.98
   246      1 227335   0.00  99.98
   259     11 227346   0.00  99.98
   261      1 227347   0.00  99.98
   263      1 227348   0.00  99.99
   268      1 227349   0.00  99.99
   288      2 227351   0.00  99.99
   291      1 227352   0.00  99.99
   296      1 227353   0.00  99.99
   298      2 227355   0.00  99.99
   301      1 227356   0.00  99.99
   308      1 227357   0.00  99.99
   322      1 227358   0.00  99.99
   328      1 227359   0.00  99.99
   331      1 227360   0.00  99.99
   333      1 227361   0.00  99.99
   355      1 227362   0.00  99.99
   364      1 227363   0.00  99.99
   381      1 227364   0.00  99.99
   394      2 227366   0.00  99.99
   398      2 227368   0.00  99.99
   418      1 227369   0.00  99.99
   420      1 227370   0.00  99.99
   434      1 227371   0.00 100.00
   474      1 227372   0.00 100.00
   488      1 227373   0.00 100.00
   517      1 227374   0.00 100.00
   562      1 227375   0.00 100.00
   566      1 227376   0.00 100.00
   571      1 227377   0.00 100.00
   606      1 227378   0.00 100.00
   615      1 227379   0.00 100.00
   618      1 227380   0.00 100.00
   620      1 227381   0.00 100.00
   621      1 227382   0.00 100.00

Database is Berkeley, disk size 20,062,208 bytes.  Ironically <wink>, the
plain-text db export file is a third that size.

[Tony Meyer]
> I'm happy to give you data for my setup, but I'm lazy <wink>.  Would
> you mind sending me/the list a copy of the Excel formulae for these?

Here's what I used:

f = file('/code/spambayes/db.txt')  # change to your export file

nham, nspam = map(int, f.readline().split(',')[:-1])
print 'nham', nham, 'nspam', nspam

hamcounts = []
spamcounts = []
for line in f:
    h, s = map(int, line.split('`')[1:3])
    hamcounts.append(h)
    spamcounts.append(s)

def hist(tag, data):
    count = {}
    for x in data:
        count[x] = count.get(x, 0) + 1

    totalcount = sum(count.itervalues())
    sofar = 0
    counts = count.items()
    counts.sort()
    print tag, "counts"
    print "value  #times   cumm      %   cumm%"
    for value, count in counts:
        sofar += count
        print "%6d %6d %6d %6.2f %6.2f" % (
            value,
            count,
            sofar,
            count * 1e2 / totalcount,
            sofar * 1e2 / totalcount)

hist("ham", hamcounts)
hist("spam", spamcounts)




More information about the Spambayes mailing list