[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Tim Peters tim.one@comcast.net
Wed, 21 Aug 2002 23:39:34 -0400


[Tim]
>> Do you expect that to be an issue?  When I built a database from 2=
0,000
>> messages, the whole thing fit in a Python dict consuming about 10M=
B.

[Eric S. Raymond]
> Hm, that's a bit smaller than I would have thought, but the order o=
f
> magnitude I was expecting.

It's even smaller than that <wink>.  The dict here maps strings to in=
stances
of a Python class (WordInfo).  The latter uses new-in-2.2 __slots__, =
and
those give major memory efficiences over old-style classes, but there=
's
still subtantial memory overhead compared to what's possible in C.  I=
n
addition, there are memory overheads for the Python objects stored in=
 the
WordInfo instances, including a Python float object in each record re=
cording
the time.time() of last access by the scoring method.

IOW, there are tons of memory overheads here, yet the size was still =
minor.
So I have no hesitation leaving this part in Python, and coding this =
part up
was a trivial finger exercise.  You know all that, though!  It makes =
your
decision to use C from the start hard to fathom.

> ...
> Recognition features should age!  Wow!  That's a good point!  With =
the
> age counter being reset when they're recognized.

For concreteness, here's the comment from the Python code, which I be=
lieve
is accurate:

    # (*)atime is the last access time, a UTC time.time() value.  It'=
s the
    # most recent time this word was used by scoring (i.e., by spampr=
ob(),
    # not by training via learn()); or, if the word has never been us=
ed by
    # scoring, the time the word record was created (i.e., by learn()=
).
    # One good criterion for identifying junk (word records that have=
 no
    # value) is to delete words that haven't been used for a long tim=
e.
    # Perhaps they were typos, or unique identifiers, or relevant to =
a
    # once-hot topic or scam that's fallen out of favor.  Whatever, i=
f
    # a word is no longer being used, it's just wasting space.

Besides the space-saving gimmick, there may be practical value in exp=
iring
older words that are getting used, but less frequently over time.  Th=
at
would be evidence that the nature of the world is changing, and more
aggressively expiring the model for how the world *used* to be may sp=
eed
adaptation to the new realities.  I'm not saving enough info to do th=
at,
though, and it's unclear whether it would really help.  Against it, w=
hile I
see new spam gimmicks pop up regularly, the old ones never seem to go=
 away
(e.g., I don't do anything to try to block spam on my email accounts,=
 and
the bulk of the spam I get is still easily recognized from the subjec=
t line
alone).  However, because it's all written in Python <wink>, it will =
be very
easy to set up experiments to answer such questions.

BTW, the ifile FAQ gives a little info about the expiration scheme if=
ile
uses.  Rennie's paper gives more:

    Age is defined as the number of e-mail messages which have been
    added to the model since frequency statistics have been kept for
    the word.  Old, infrequent words are to be dropped while young wo=
rds
    and old, frequent words should be kept.  One way to quantify this
    is to say that words which occur fewer than log2(age)-1 times
    should be discarded from the model.  For example, if =93baseball=
=94
    occurred in the 1st document and occurred 5 or fewer times in the
    next 63 documents, the word and its corresponding statistics woul=
d
    be eliminated from the model=92s database.  This feature selectio=
n
    cutoff is used in ifile and is found to significantly improve
    efficiency without noticeably affecting classification performanc=
e.

I'm not sure how that policy would work with Graham's scheme (which h=
as many
differences from the more conventional scheme ifile uses).  Our Pytho=
n code
also saves a count of the number of times each word makes it into Gra=
ham's
"best 15" list, and I expect that to be a direct measure of the value=
 we're
getting out of keeping a word ("word" means whatever the tokenizer pa=
sses to
the classifier -- it's really any hashable and (for now) pickleable P=
ython
object).

[on Judy]
> I thought part of the point of the method was that you get
> sorting for free because of the way elements are inserted.

Sure, if range-search or final sorted order is important, it's a grea=
t
benefit.  I was only wondering about why you'd expect spatial localit=
y in
the input as naturally ordered.

> ...
> No, but think about how the pointer in a binary search moves.  It's
> spatially bursty, Memory accesses frequencies for repeated binary
> searches will be a sum of bursty signals, analogous to the way
> network traffic volumes look in the time domain.  In fact the
> graph of memory adress vs. number of accesses is gonna win up
> looking an awful lot like 1/f noise, I think.  *Not* evenly
> distributed; something there for LRU to weork with.

Judy may or may not be able to exploit something here; I don't know, =
and I'd
need to know a lot more about Judy's implementation to even start to =
guess.
Plain binary search has horrid cache behavior, though.  Indeed, most =
older
research papers on B-Trees suggest that binary search doesn't buy any=
thing
in even rather large B-Tree nodes, because the cache misses swamp the
reduced instruction count over what a simple linear search does.  Wor=
se,
that linear search can be significantly faster if the HW is smart eno=
ugh to
detect the regular address access pattern of linear search and do som=
e
helpful prefetching for you.  More recent research on B-Trees is on w=
ays to
get away from bad binary search behavior; two current lines are using
explicit prefetch instructions to minimize the stalls, and using a mo=
re
cache-friendly data structure inside a B-Tree node.  Out-guessing mod=
ern VM
implementations is damned hard.

With a Python dict you're likely to get a cache miss per lookup.  If =
that's
a disaster, time.clock() isn't revealing it <wink>.

> ...
> What I'm starting to test now is a refactoring of the program where=
 it
> spawn a daemon version of itself first time it's called.  The daemo=
n
> eats the wordlists and stays in core fielding requests from subsequ=
ent
> program runs.  Basically an answer to "how you call bogofilter 1K
> times a day from procmail without bringing your disks to their knee=
s"
> problem" -- persistence on the cheap.
>
> Thing is that the solution to this problem is very generic.  Might
> turn into a Python framework.

Sounds good!  Just don't call it "compilerlike" <wink>.