[Spambayes] URL parsing improvement ideas

Harri Pesonen harri.pesonen at wicom.com
Thu Aug 28 13:25:40 EDT 2003


Great, thanks! :-)

I wonder why 1) was a loss. Perhaps it should add a special token when
it finds any % escapes, and then replace them. Care to try this as well?
:-)

Please send me the source code, I am reading the Quick Python book at
the moment...

Harri

-----Original Message-----
From: Meyer, Tony [mailto:T.A.Meyer at massey.ac.nz] 
Sent: 28. elokuuta 2003 11:51
To: Harri Pesonen; spambayes at python.org
Subject: RE: [Spambayes] URL parsing improvement ideas


> 1) Replace % escapes.

This is the decodes column.  A definite loss.

> 2) Find server names for ip addresses.

Still running.  This is very slow.  I'll post the results when they
arrive, but it would have to be amazing for it to be worth waiting this
long ;)

> 3) Remove numbers from the end of domain names (experimental).
>    www.buythis123.com => url:buythis

This is the no_num_urls column.  No difference.

> Or add a special token for domains ending with a number.

This is the url_end_nums column.  No effective difference.

---

filename:  standards       url_end_nums
                   no_num_urls     decodes
ham:spam:  7900:15260      7900:15260
                   7900:15260      7900:15260
fp total:        1       1       1       2
fp %:         0.01    0.01    0.01    0.03
fn total:      225     225     224     222
fn %:         1.47    1.47    1.47    1.45
unsure t:      531     531     533     558
unsure %:     2.29    2.29    2.30    2.41
real cost: $341.20 $341.20 $340.60 $353.60
best cost: $540.60 $540.80 $540.40 $547.40
h mean:       0.50    0.50    0.51    0.60
h sdev:       4.25    4.25    4.27    4.72
s mean:      93.44   93.43   93.46   93.46
s sdev:      20.68   20.69   20.64   20.41
mean diff:   92.94   92.93   92.95   92.86
k:            3.73    3.73    3.73    3.70

---

These are only my results, of course, and not guaranteed to match anyone
else's.  If you (or anyone) has the testing setup ready and would like
to run any of these, I can provide patches or alternative versions of
tokenizer.py; just let me know.

Other ideas?  (Testing is fun ;)

=Tony Meyer



More information about the Spambayes mailing list