[Tutor] regex woes in finding an ip and GET string

Gerhardus Geldenhuis gerhardus.geldenhuis at gmail.com
Sun Jun 19 13:25:07 CEST 2011


Hi
I am trying to write a small program that will scan my access.conf file and
update iptables to block anyone looking for stuff that they are not supposed
to.

The code:
#!/usr/bin/python
import sys
import re

def extractoffendingip(filename):
  f = open(filename,'r')
  filecontents = f.read()
#193.6.135.21 - - [11/Jun/2011:13:58:01 +0000] "GET
/admin/pma/scripts/setup.php HTTP/1.1" 404 304 "-" "Mozilla/4.0 (compatible;
MSIE 6.0; Windows 98)"
  tuples = re.findall(r'^(\d+\.\d+\.\d+\.\d+).*\"GET(.*)HTTP', filecontents)
  iplist = []
  for items in tuples:
    (ip, getstring) = items
    print ip,getstring
    #print item
    if ip not in iplist:
      iplist.append(ip)
  for item in iplist:
    print item
  #ipmatch = re.search(r'', filecontents)

def main():
  extractoffendingip('access_log.1')

if __name__ == '__main__':
  main()

logfile=http://pastebin.com/F3RXDYBW


I could probably have used ranges to be more correct about finding ip's but
I thought that apache should take care of that. I am assuming a level or
integrity in the log file with regards to data...

The first problem I ran into was that I added a ^ to my search string:
re.findall(r'^(\d+\.\d+\.\d+\.\d+).*\"GET(.*)HTTP', filecontents)

but that finds only two results a lot less than I am expecting. I am a
little bit confused, first I thought that it might be because the string I
am searching is now only one line because of the method of loading and the ^
should only find one instance but instead it finds two?

So removing the ^ works much better but now I get mostly correct results but
I also get a number of ip's with an empty get string, only thought there
should be only one in the log file. I would really appreciate any pointers
as to what is going on here.

Regards

-- 
Gerhardus Geldenhuis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110619/01319759/attachment.html>


More information about the Tutor mailing list