[Spambayes] Introducing myself

Tue Nov 12 01:32:18 2002

11/11/2002 7:27:03 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Robert Woodhead]
>> ...
>> It seems to me that you're at the point where testing the effects of
>> data reduction techniques would be fruitful.
>
>Bootstrapping a classifier, connecting to a gazillion quirky email clients,
>and testing training strategies are all current high priorities.  Saving
>memory wouldn't buy me anything in the Outlook client I'm using, or in the
>high-volume python.org application.  But, as I said, other people are keener
>on that, and I expect that reducing the sheer number of tokens is a more
>effective approach (in part because it ties into effective training
>strategies over time -- the database will just keep growing (albeit at a
>slackening pace) without active pruning, and whether a token takes one byte
>or 50).
>
>> Once I get up and running on the code (just paid  the tithe to O'Reilly)
>> I'll test it out.
>
>It's all yours <wink>.
>
>> One thing that occurred to me: now that you have something that seems
>> to work pretty well, have you considered backtracking on particular
>> features to see how much they contribute; for example, going to a
>> trivial state machine parser to spit out tokens?
>
>In theory, all prior decisions should be revisited after every change.  I
>haven't done anything like that lately, though, in part because no previous
>"let's revisit this!" experiment ever paid off.
>
>Note that the bulk of the body tokenizer couldn't be simpler:
>
>1. Convert to lowercase.
>2. Split on whitespace.

This makes me wonder what happens if someone spams you with various devices 
like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of
w
h
i
t
e
s
p
a
c
e

- TimS

>
>Well, we *could* skip #1, but previous experiments found that it didn't give
>better error rates but did increase the database size.  It did change the
>*kinds* of errors, though, and in particular conference announcements had a
>hard time getting thru when case was preserved (they're trying to sell you a
>conference, and often SCREAM ABOUT IT).
>
>> ...
>> Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!")
>
>They had 6 or 9 when I was a lad, depending on how you set the control bit
>for the Univac 1108's 36-bit words.
>
>> have lots of tricks.  We don't so much write code as remember it and
>> retype it.
>
>You don't want to bet on who'e older here <wink>.
>
>> ...
>> Not really; it doesn't really matter what the format of a token
>> coming out of the parser is, does it?
>
>The classifier is happy with any immutable and hashable Python object, i.e.
>anything that can be used as a Python dict key.  But people grafting various
>databases onto this have stronger requirements, and they're not always
>clear.  As I mentioned last time, most "lightweight" databases require
>string keys, so any switch away from strings would break those systems.
>It's pre-alpha code, but still I'm not keen to rock anyone's boat unless
>there's a clear win in return.
>
>> ...
>> True; then it becomes a game of finding generic messages that are
>> likely to evaluate as hammy enough to the average recognizer.  And
>> the meta-response is to send out multiple emails with differently
>> tuned slices of ham.
>
>They can try.  Spam doesn't need to be stopped, though, it merely has to be
>made more costly to send than it brings back.
>
>Last week Jeremy and Guido here both reported a *very* effective technique:
>spam was sent to them as replies to mailing-list postings (not this mailing
>list <wink>) they had made, including a full quote of the msg they had
>posted.  That was guaranteed to have lots of ham words for them, and the
>Subject line was the expected "Re:" followed by their own subject line.
>
>I doubt they're going to get a response rate high enough to be able to
>afford this scheme over time, at least not on tech mailing lists.  We'll
>see; if they can, it's going to be hard to beat.
>
>> I hereby, btw, coin the term "Dagwood" (or perhaps it should be
>> Wooddag?) to mean an email containing artfully sliced amounts of ham,
>> spam, and html condiments.  ;^)
>
>Cool!  Dagwood it is.
>
>> ...
>> Well, what you'd need is a hacked HTML renderer that output sets that
>> look like (token,size,color,background) and ignored words that were
>> too small or hard to read.
>
>Sure.  I expect the quickest path would be to feed the source thru a
>text-only browser, and stare its output.  That seems mondo expensive,
>though,
>
>>> For goodness sake, this is email we're talking about -- anyone
>>> trusting a truly critical msg to email is dreaming to begin with.
>
>> Unfortunately, in the real world, this happens all too often.  Keep
>> in mind that the readers of this list are not the typical users of
>> the resulting software techniques.
>
>I do, but it's still not my problem <0.5 wink>.  All non-trivial systems
>have non-zero FP rates, and that's a fact of life.  You're keen on
>whitelists, but they wouldn't do a thing to stop any of the false positives
>I've seen, and so on; a multitude of schemes may reduce the overall error
>rates if they're combined intelligently, but they're not going to reach an
>error rate of 0.  Not even with human review (as has become obvious to
>everyone who's run a good system over their supposedly clean ham and spam
>collections).  At some point, learning that Santa Claus isn't actually a
>white man is a part of growing up <wink>.
>
>show-me-an-isp-that-guarantees-email-delivery-and-we'll-get-
>    rich-shorting-its-stock-ly y'rs  - tim
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com