Help beautify ugly heuristic code

Carlos Ribeiro carribeiro at gmail.com
Wed Dec 8 19:13:57 EST 2004


On 8 Dec 2004 15:39:15 -0800, Lonnie Princehouse
<finite.automaton at gmail.com> wrote:
> Regular expressions.
> 
> It takes a while to craft the expressions, but this will be more
> elegant, more extensible, and considerably faster to compute (matching
> compiled re's is fast).

I think that this problem is probably a little bit harder. As the OP
noted, each ISP uses a different notation. I think that a better
solution is to use a statistical approach, possibly using a custom
Bayesian filter that could "learn" a little bit about some patters.

The basic idea is as follows:

-- break the URL in pieces, using not only the dots, but also hyphens
and underscores in the name.

-- classify each part, using REs to identify common patterns: frequent
strings (com, gov, net, org); normal words (sequences of letters);
normal numbers; combinations of numbers & letters; common substrings
can also be identified (such as isp, in the middle of one of the
strings).

-- check these pieces against the Bayesian filter, pretty much as it's
done for spam.

I think that this approach is promising. It relies on the fact that
real servers usually do not have numbers in their names; however,
exact identification either by a match or a regular expression is very
difficult. I'm willing to try it, but first, more data is needed.

-- 
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: carribeiro at gmail.com
mail: carribeiro at yahoo.com



More information about the Python-list mailing list