Help beautify ugly heuristic code

Stuart D. Gathman stuart at bmsi.com
Wed Dec 8 20:27:18 EST 2004


On Wed, 08 Dec 2004 19:52:53 -0500, Lonnie Princehouse wrote:

> I don't think a Bayesian classifier is going to be very helpful here,
> unless you have tens of thousands of examples to feed it, or unless it

We do have tens of thousands of examples to feed it.

> The series of if host.find(...) lines in is_dynip() is equivalent to a
> regular expression, but much more expensive to execute because of all

It is not equivalent, because the patterns are based on the IP address.
As I mentioned before, I tried building a custom regex from the IP for
each test - but compiling the regex is way too slow to be done for each
test.

> For IP addresses, you really just need a mechanism to filter blocks of
> IP addresses.  It might be easiest to first convert them into hex and
> then make liberal use of [0-f] in regular expressions.

The point of the ip address is *not* to recognize ip addresses.  The
point is to look for transformations of the ip address in the hostname.
This gives a *huge* bang for the buck.  I have been working on this
problem for a while.  If the hostname has a transformation of the ip
address - it is (almost certainly) a dynamic address.  The ISPs are very
creative in their transformations, using the parts of the ip in various
orders and encoding in hex, base64, decimal with or without zerofill, and
even roman numerals.

The regex engine is just not powerful enough to handle parameterized
regexe (that I know of).



More information about the Python-list mailing list