Help beautify ugly heuristic code

Mitja nun at example.com
Thu Dec 9 06:16:42 EST 2004


On Wed, 08 Dec 2004 16:09:43 -0500, Stuart D. Gathman <stuart at bmsi.com>  
wrote:

> I have a function that recognizes PTR records for dynamic IPs.  There is
> no hard and fast rule for this - every ISP does it differently, and may
> change their policy at any time, and use different conventions in
> different places.  Nevertheless, it is useful to apply stricter
> authentication standards to incoming email when the PTR for the IP
> indicates a dynamic IP (namely, the PTR record is ignored since it  
> doesn't
> mean anything except to the ISP).  This is because Windoze Zombies are  
> the
> favorite platform of spammers.

This is roughly it.... you'll have to experiment and find the right  
numbers for different pattern matches, maybe even add some extra criteria  
etc. I don't have the time for it right now, but I'd be interested to know  
how much my code and yours differ in the detection process (i.e. where are  
the return values different).

Hope the indentation makes it through alright.

#!/usr/bin/python

import re
reNum = re.compile(r'\d+')
reWord = re.compile(r'(?<=[^a-z])[a-z]+(?=[^a-z])|^[a-z]+(?=[^a-z])')
#words that imply a dynamic ip
dynWords = ('dial','dialup','dialin','adsl','dsl','dyn','dynamic')
#words that imply a static ip
staticWords = ('cable','static')

def isDynamic(host, ip):
   """
     Heuristically checks whether hostname is likely to represent
     a dynamic ip.
     Returns True or False.
   """

   #for easier matching
   ip=[int(p) for p in ip.split('.')]
   host=host.lower()

   #since it's heuristic, we'll give the hostname
   #(de)merits for every pattern it matches further on.
   #based on the value of these points, we'll decide whether
   #it's dynamic or not
   points=0;

   #the ip numbers; finding those in the hostname speaks
   #for itself; also include hex and oct representations
   #lowest ip byte is even more suggestive, give extra points
   #for matching that
   for p in ip[:3]:
     #bytes 0, 1, 2
     if (host.find(`p`) != -1) or (host.find(oct(p)[1:]) != -1): points+=20
   #byte 3
   if (host.find(`ip[3]`) != -1) or (host.find(oct(ip[3])[1:]) != -1):  
points+=60
   #it's hard to distinguish hex numbers from "normal"
   #chars, so we simplify it a bit and only search for
   #last two bytes of ip concatenated
   if host.find(hex(ip[3])[2:]+hex(ip[3])[2:]) != -1: points+=60

   #long, seemingly random serial numbers in the hostname are also a hint
   #search for all numbers and "award" points for longer ones
   for num in reNum.findall(host):
     points += min(len(num)**2,60);

   #substrings that are more than just a hint of a dynamic ip
   for word in reWord.findall(host):
     if word in dynWords: points+=30
     if word in staticWords: points-=30

   print '[[',points,']]'
   return points>80

if __name__=='__main__':
   for line in open('dynip.samp').readlines()[:50]:
     (ip,host) = line.rstrip('DYN').split()[:2]
     if host.find('.') != -1:
       print host, ip, ['','DYNAMIC'][isDynamic(host,ip)]


-- 
Mitja



More information about the Python-list mailing list