[Spambayes] Matt Sergeant: Introduction

Anthony Baxter anthony@interlink.com.au
Tue, 01 Oct 2002 19:29:48 +1000


>>> Matt Sergeant wrote
> And to give back I'll tell you that one of my biggest wins was parsing 
> HTML (with HTML::Parser - a C implementation so it's very fast) and 
> tokenising all attributes, so I get:
> 
>    colspan=2
>    face=Arial, Helvetica, sans-serif
> 
> as tokens. Plus using a proper HTML parser I get to parse HTML comments 
> too (which is a win).

With the Graham code, we found that the simple minded parsing of HTML
actually hurt more than it gained, but it was a _very_ simple split-on-
whitespace. In a case of syncronicity, at the moment I'm running a test 
over my newer larger monster corpus (35Kh/17Ks) to extract the avpairs 
from HTML tokens.

Anthony