[spambayes-dev] Re: Chung-Kwei algorithm (from BBC news)

Thu Aug 26 15:46:45 CEST 2004

Firmicus at ankabut.net quoted a press release:

> 'DNA analysis' spots e-mail spam
>
> By Jo Twist
>
> BBC News Online science and technology staff
>
>
> Few would have thought that when Crick and Watson discovered DNA,
> it would help in making a tool to fight spam.
>
> But computational biologists at IBM's TJ Watson Research Center
> have devised an anti-spam filter based on the way scientists analyse
> genetic sequences.
>
> Called after Feng Shui character Chung-Kwei, the formula automatically
> learns patterns of spam vocabulary and has proved to be 96.5% efficient.
>
> In tests, the filter only misidentified one message in 6,000 as spam.
>
> Pillar of protection
>
> Isidore Rigoutsos and Tien Huynh, at IBM's bioinformatics and pattern
> discovery research group, started to develop the formula - or algorithm
> - a little over a year ago.
>
> They named the formula, Chung-Kwei, after a Feng Shui character who is
> usually shown carrying a bat, and also holds a sword behind him.
>
> He is an important figure for those involved in business and who have
> expensive goods that need protection.
>
> Chung-Kwei grew out of another algorithm called Teiresias which the
> researchers were using for pattern discovery in computation biology
> sequencing, specifically, in protein annotation.
>
> 	"To train 88,000 messages takes about 15 minutes on a normal single processor. 
>         If, an hour, later we have more spam we can add to the collection so we keep 
>         on learning more and more"
>         -- Isidore Rigoutsos, IBM
>
> The algorithm helped in automatically determining the properties of
> a protein, like function and structure, directly from a string.
>
> "Obviously algorithms that pertain to pattern discovery are applicable
> to a vast range of problems," Mr Rigoutsos explained to BBC News Online.
>
> Instead of looking at strings of protein, Chung-Kwei uses Teiresias to
> identify strings of character sequences which appear in spam, but never
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ exactly my algorithm
> in non-spam mail.
>
> Their work, said Mr Rigoutsos, was helped by the large volume of spam
> which they received at their own workplace.
>
> "We have lots of e-mails that we know are bona fide spam. If we run a
> pattern analysis on those, it can see letters that appear frequently.
>
> "One of the properties of the algorithm is that it will spot two or
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> more occurrences. It doesn't matter where it is in the message.
  ^^^^^^^^^^^^^^^^ Exactly what mine does
>
> "If you do this, effectively you get small collections of letters so
> you can think of these as a vocabulary of sorts. If you have lots of
> data to work with, your vocabulary will be able to describe the data
> in a different form."
>
> Spam training
>
> The algorithm can be trained so that it will not be fooled by cunning
> replacements of "S" with "$", a common ploy used by spammers to bypass
> conventional e-mail filters.
>
> The Chung-Kwei method builds up its database of known true-spam
> patterns and constantly adds new patterns it spots.
>
> It compares its vocabulary to e-mails which it knows do not contain
> spam. So, an incoming message hit with this pattern analysis will be
> rejected if it contains a large proportion of the same vocabulary
> patterns.
>
> If a message received had a lot of spam patterns in it, it was scored
> highly. Chung-Kwei succeeded in spotting almost 97% of junk mails.
>
> "We experimented with large collections of e-mail. We have 66,000
> training messages that are all spam and 22,000 training messages that
> are all 'white' [non-spam].
>
> "To train 88,000 messages takes about 15 minutes on a normal single
> processor. If, an hour, later we have more spam, we can add to the
> collection so we keep on learning more and more."
>
> Various anti-spam software use several techniques to spot and kill
> junk mail, but IBM believes the Chung-Kwei algorithm to be the only
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> anti-spam tool that uses pattern discovery in this way.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mine does)
>
> Some tools look at the route an e-mail has taken and its origins;
> others involve identity verification and black and white listing
> of accepted and not accepted addresses.
>
> Others use Bayesian combinations of individual words that statistically
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ie this doesn't)
> make up spam messages.
>
> The system has to go through some more pilot studies and testing
> before it is let loose to protect inboxes.

I published what I am fairly sure is the same algorithm, or at least
close enough to invalidate a patent on the grounds of prior art, back
in February of this year.  I also mailed it to a few prominent researchers
like Bill Yerazunis, just to get it on the record.  (One of those
researchers was a spam researcher at IBM incidentally, though I don't
believe he was involved in the project quoted above)

Here's the links, decide for yourself.  I'm not interested in taking
credit for this, if it is the same algorithm, just in making sure it
is not patented.

------------------

] Date: Fri, 18 Jun 2004 12:33:02 -0500
] From: Graham Toal <gtoal at gtoal.com>
] To: mertz at gnosis.cx
] Subject: spam algorithms
] Message-ID: <40D3274E.mailN3311EYMT at gtoal.com>
] User-Agent: nail 10.5 4/27/03
] MIME-Version: 1.0
] Content-Type: text/plain; charset=us-ascii
] Content-Transfer-Encoding: 7bit
]
] I was searching the IBM search engine for spam algorithms and
] found your page:
] 
] > . Bayesian trigram filters
] > I decided to look into how well a much more starkly limited
] > model space would work for a Bayesian spam filter.
] > Specifically, I decided to use trigrams for my probability
] > model rather than "words".
]
] Have a look at some notes I wrote up earlier this year:
]    http://www.gtoal.com/mt/archives/2004_02.html
]
] You may find this approach interesting.  It's sort of like your idea but
] taken to the extreme.  Subsequent to writing the description in the above
] link, I have actually hacked up some prototype code and the idea does show
] promise.  I've not had the resources to make a fully usable spam filter
] out of it, but it's a convincing proof of concept.
]  ( http://www.gtoal.com/spam/devel-temp/tokra3.c.html )
]
] I hope you don't mind me mailing this to you unsolicited, but I want
] a few people working in the field to be aware of it as I think it is
] original and by making it public I can forstall someone reinventing
] it later and claiming a bogus patent on it!
]
]
] Regards
]
] Graham Toal
---------------------------

I am BCC'ing this post to Isidore Rigoutsos and Tien Huynh for their
opinion as to whether their "pattern discovery" algorithm and my
"token recognition" algorithm are the same algorithm (and to ask as
to whether IBM is attempting to patent it...)

G