[Tutor] newbie confused about text parsing

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Sun Jul 13 20:52:02 2003


On Sun, 13 Jul 2003, Chris Johnson wrote:

>     I'm a unix administrator and I want to learn python to help in my
> job. I thought parsing a log file would be a good start but I'm stuck on
> how to start.
>
> I'm working with a firewall log file the contents of which look something
> like this.
>   Nov 30 00:58:05 firewall kernel: Shorewall:man1918:DROP:IN=eth0 OUT=
> MAC=ff:ff:ff:ff:ff:ff:00:90:f5:1e:15:aa:08:00 SRC=10.1.2.27 DST=10.1.2.255
> LEN=96 TOS=0x00 PREC=0x00 TTL=128 ID=4853 PROTO=UDP SPT=137 DPT=137 LEN=76

Hi Chris,

Ok, sounds like an interesting project!  Shorewall is based on the Linux
'iptables' system,

    http://www.shorewall.net/

so if you can find a log parser that handles iptables's log format, you
may be able to successfuly use it for shorewall.


Anyway, let's see what we can help with, assuming that there's no
third-party module out there yet.


> I want to loop through log file looking for a string (Shorewall)

This part shouldn't be too hard: as we're looping through the log, we can
look for a "substring" by using each line's find() method.  Lines that
don't have 'Shorewall' should be skipped, and lines that do have it will
need further parsing.  Here's some sample code that says this more
formally:

###
def hasShoreline(s):
    "Returns true if the string 's' has the word "shoreline" in it."
    return s.find('shoreline') != -1

logfile = open("/var/log/messages")
for line in logfile:
    if hasShoreline(line):
        doSomeMoreParsing(line)
###


> then parse each matching line into a dictionary which I can sort or put
> into database fields.
>
> I've been reading the documentation on both modules re and string but
> which do I use. I'd like to run this script all the time so entries are
> added in near real time to the database.

Regular expressions sounds like a good thing for this project.  It's very
likely that you'll need to write a regular expression to extract certain
patterns from the log file.

A.M. Kuchling's written a pretty nice "Python Regular Expression HOWTO"
that's a brief tutorial about regular expressions:

    http://www.amk.ca/python/howto/regex/regex.html


Let's take a look again at that log file line:

>   Nov 30 00:58:05 firewall kernel: Shorewall:man1918:DROP:IN=eth0 OUT=
> MAC=ff:ff:ff:ff:ff:ff:00:90:f5:1e:15:aa:08:00 SRC=10.1.2.27 DST=10.1.2.255
> LEN=96 TOS=0x00 PREC=0x00 TTL=128 ID=4853 PROTO=UDP SPT=137 DPT=137 LEN=76

There apperas to be a fairly consistant pattern here to the field-value
pairs.  There's a uppercased "field name", followed by an equal sign '=',
and then the field value.  In regular expression terms, we's say that
we're looking for:

    [A-Z]+           ## A bunch of uppercased letters, the "field name"
    =                ## followed by the equal sign
    \S+              ## and then the field value.  I'll guess at the
                     ## that this should be any "nonspace"
                     ## character.

In regular expression syntax, the plus sign means "one or more of the
preceding kind of character'.  Now, I have to admit that the above code is
a complete hack: I have no clue if it'll capture all shoreline log
messages properly.

We can try this out, though, and see how well it works:

###
>>> regex = re.compile(r'''     [A-Z]+
...                             =
...                             \S+''', re.VERBOSE)
>>> s = '''Nov 30 00:58:05 firewall kernel: Shorewall:man1918:DROP:
...        IN=eth0 OUT=MAC=ff:ff:ff:ff:ff:ff:00:90:f5:1e:15:aa:08:00
...        SRC=10.1.2.27 DST=10.1.2.255 LEN=96 TOS=0x00 PREC=0x00
...        TTL=128 ID=4853 PROTO=UDP SPT=137 DPT=137 LEN=76'''
>>> regex.findall(s)
['IN=eth0', 'MAC=ff:ff:ff:ff:ff:ff:00:90:f5:1e:15:aa:08:00',
 'SRC=10.1.2.27', 'DST=10.1.2.255', 'LEN=96', 'TOS=0x00', 'PREC=0x00',
 'TTL=128', 'ID=4853', 'PROTO=UDP', 'SPT=137', 'DPT=137', 'LEN=76']
###

Looks sorta decent.  This should get you started.  *grin*


But should we reinvent the wheel?  Let's see... there do appear to be a
few iptables parsers in Perl:

    http://caspian.dotconf.net/menu/Software/ScanAlert/
    http://www.dshield.org/framework.php

I haven't found any iptables parsers in Python yet.  You may want to ask
on the comp.lang.python newsgroup to see if anyone has one already cooked
up.  If you'd like, we can look at one of the Perl ones, and see how one
might port the code into Python.


Good luck to you!