[Tutor] regex woes in finding an ip and GET string

Peter Otten __peter__ at web.de
Mon Jun 20 09:58:13 CEST 2011


Gerhardus Geldenhuis wrote:

> I am trying to write a small program that will scan my access.conf file
> and update iptables to block anyone looking for stuff that they are not
> supposed to.
> 
> The code:
> #!/usr/bin/python
> import sys
> import re
> 
> def extractoffendingip(filename):
>   f = open(filename,'r')
>   filecontents = f.read()
> #193.6.135.21 - - [11/Jun/2011:13:58:01 +0000] "GET
> /admin/pma/scripts/setup.php HTTP/1.1" 404 304 "-" "Mozilla/4.0
> (compatible; MSIE 6.0; Windows 98)"
>   tuples = re.findall(r'^(\d+\.\d+\.\d+\.\d+).*\"GET(.*)HTTP',
>   filecontents) 

If you want to process the whole file at once you have to use the 
re.MULTILINE flag for the regex to match the start of a line instead of the 
start of the whole text:

    tuples = re.compile(r'...', re.MULTILINE).findall(filecontents)

But I think it's better to process the file one line at a time.

>   iplist = []
    [snip]
>     if ip not in iplist:
>       iplist.append(ip)

So you want every unique ip appear only once in iplist. Python offers an 
efficient data structure for that, the set. With these changes your funtion 
becomes something like (untested)

def extractoffendingips(filename):
    match = re.compile(r'^(\d+\.\d+\.\d+\.\d+).*\"GET(.*)HTTP').match
    ipset = set()
    with open(filename) as f:
        for line in f:
            m = match(line)
            if m is not None:
                ip, getstring = m.groups()
                ipset.add(ip)
    for item in ipset:
        print item




More information about the Tutor mailing list