[Spambayes] Strip Subject of Non-alpha

Tue Dec 9 11:03:18 EST 2003

    >> I never got overwhelming encouragement for my ideas about how to add
    >> experimental extensions to the CVS repository.

    Tim> Probably because it came attached to such a weak change <wink>.

Okay, ignore the bit about a specific "enhancement".  We all know most of
them don't work anyway.  Still, suppose someone comes up with an idea (we
get them all the time in the spambayes mailing list): "I know, how about
using the new header transmogrification feature of RFC-4822?", but doesn't
have the programming cojones to implement it.  Someone else comes along,
realizes it wouldn't be such a big deal to implement, does so and posts,
"Okay, try the version in CVS.  SpamBayes now has a "Headers:X-transmogrify"
option.  Let us know whether it helps or not."

People can then experiment with RFC-4822 transmogrification.  If it proves
not to be a worthy addition, the code can be ripped out.  The key is
tweaking the options parser to not care if there is no
"Tokenizer:X-transmogrify" option (because the code was ripped out later) or
to map "Tokenizer:X-transmogrify" to "Tokenizer:transmogrify" if it gains
acceptance and moves out of the trial stage.  (In fact, perhaps it should
work the other was as well, so we can rip stuff out that's not useful
without breaking peoples' options files.  See below.)

I just checked in a change to spambayes/OptionsClass.py which implements an
experimental/deprecated option feature.  It works like this:

    * Option is "foo", user sets "foo".  status quo.
    * Option is "X-foo", user sets "X-foo".  status quo.
    * Option is "foo", user sets "X-foo".  "foo" is set silently.
    * Option is "X-foo", user sets "foo".  "X-foo" is set and a warning
      emitted.

The third case covers experimental options.  The fourth case covers
deprecated options.  (The description for deprecated options in Options.py
should start with "(DEPRECATED) ".)

    Tim> Really, a few people tested it and it didn't seem to matter either
    Tim> way.

Granted.  One thing I wonder about is how "current" peoples' training
databases are.  New techniques like cömmênt àccéntüätîón or em.bed-ed
punc#tua_tion aren't likely to turn up much in older training databases.  I
canned my old training database recently and have been working on rebuilding
it from scratch.  I think it's important that our training databases evolve
as spam does.

Another change I have locally is the remove_punctuation tokenizer gimmick I
alluded to above.  It also doesn't seem to change fp/fn results at the level
of pushing messages clearly out of one category into another, however it
seems to pretty consistently spread the ham/spam means apart a bit and
reduce their standard deviations.  I'm more interested in a framework for
making such experimental changes easier for non-programmers to try out.

    Tim> Experimental extensions are fine by me, and you proposed a decent
    Tim> scheme for putting them in.  The downside is that every piece of
    Tim> code complicates the whole, and I really don't know why you'd
    Tim> *want* to check in a gimmick that made no real difference to anyone
    Tim> who tried it (if I remember all the reports correctly -- maybe
    Tim> not).

The point isn't sticking code in, it's being able to easily yank it back
out.  (I think my checking should make that easier.)  You mentioned
generate_time_buckets and extract_dow.  I'll turn the screws in a moment to
deprecate them.  If this idea doesn't fly with people, or these options are
deemed crucial for enough people we can just un-deprecate them.

(BTW, has anyone on a Unix-ish system tried out testtools/Makefile when
running timcv?  If so, does it help or am I the only person who finds it
useful?)

Skip