Help beautify ugly heuristic code

Lonnie Princehouse finite.automaton at gmail.com
Wed Dec 8 19:52:53 EST 2004


I don't think a Bayesian classifier is going to be very helpful here,
unless you have tens of thousands of examples to feed it, or unless it
was specially coded to first break addresses into better tokens for
classification (such as alphanumeric strings and numbers).

The series of if host.find(...) lines in is_dynip() is equivalent to a
regular expression, but much more expensive to execute because of all
the list slicing, and it won't benefit from the re module's speedy
native implementation of regular expressions.

Try building a host_expr (as per my previous post) in the following
way:

# suppose dynamic_host_list is a list of all the host strings already
known to be
# dynamic.

. host_patterns = {}  # use a dict to guarantee uniqueness.  sets would
also work.
. number_expr = re.compile("\d+")
. for dynamic_host in dynamic_host_list:
.     pattern = '^' + number_expr.sub("\d+", dynamic_host) + '$'
.     host_patterns[pattern] = True
.
. host_expr = re.compile('|'.join(host_patterns.keys()))

This will catch any hostname that differs only in numbers from any
other host you've
already classified.

For IP addresses, you really just need a mechanism to filter blocks of
IP addresses.  It might be easiest to first convert them into hex and
then make liberal use of [0-f] in regular expressions.




More information about the Python-list mailing list