Logfile analysing with pyparsing

Andi Clemens andi.clemens at gmx.net
Tue Sep 26 01:21:59 EDT 2006


Hi,

we had some problems in the last weeks with our mailserver.
Some messages were not delivered and we wanted to know why.
But looking through the logfile is a time consuming process.
So I wanted to write a parser to analyse the logs and parse them as XML.

But I have never written a parser before and know I'm sitting in front 
of the logfile trying to write the grammar for pyparsing.

First of all I need to know if it is possible to parse that kind of info 
into XML.
Here is an excerpt of the logfile lines I'm interested in:

Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301: 
message-id=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>
Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message 
<200609180214.k8I2EuNo016264 at mforward2.dtag.de> for nobody:65534
Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 - 
BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL 
scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>,bayes=1,autolearn=no 

Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E: 
message-id=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD: 
message-id=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD: 
to=<SPAM-FOUND at OUR-MAILSERVER.mail.com>, relay=10.49.0.7[10.49.0.7], 
delay=1, status=sent (250 2.6.0 
<200609180214.k8I2EuNo016264 at mforward2.dtag.de> Queued mail for delivery)

They are filtered by "message-id", so all these lines above have 
something to do with the message 
"200609180214.k8I2EuNo016264 at mforward2.dtag.de".

The original logfile is about 25 MB big, so I can't post all of the 
lines of course ;-)

Looking at these lines I realized that there are "Queue IDs":
755387301
DA1431965E
EF90720AD

Filtering the log for these IDs results in the following lines:

Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301: 
client=unknown[194.25.242.123]
Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301: 
message-id=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>
Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301: 
from=<sender at mail.net.mx>, size=8152, nrcpt=7 (queue active)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301: 
to=<receiver1 at mail.com>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301: 
to=<receiver2 at mail.com>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301: 
to=<receiver3 at mail.com>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301: 
to=<receiver4 at mail.com>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed

Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534 
from=<nobody>
Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E: 
message-id=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E: 
from=<nobody at OUR-MAILSERVER.mail.com>, size=11074, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E: 
to=<SPAM-FOUND at OUR-MAILSERVER.mail.com>, relay=localhost[127.0.0.1], 
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed

Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD: 
client=localhost[127.0.0.1]
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD: 
message-id=<200609180214.k8I2EuNo016264 at mforward2.dtag.de>
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E: 
to=<SPAM-FOUND at OUR-MAILSERVER.mail.com>, relay=localhost[127.0.0.1], 
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: 
from=<nobody at OUR-MAILSERVER.mail.com>, size=11263, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD: 
to=<SPAM-FOUND at OUR-MAILSERVER.mail.com>, relay=10.49.0.7[10.49.0.7], 
delay=1, status=sent (250 2.6.0 
<200609180214.k8I2EuNo016264 at mforward2.dtag.de> Queued mail for delivery)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed

All this work is done with command line and grep...

Is it possible to parse this big logfile only ONCE and extract all this 
info into XML?

Like this:

<message id="200609180214.k8I2EuNo016264 at mforward2.dtag.de">
   <timestamp>Sep 18 04:15:26</timestamp>
   <from>sender at mail.net.mx</from>
   <to>receiver1 at mail.com</to>
   <to>receiver2 at mail.com</to>
   <to>receiver3 at mail.com</to>
   <to>receiver4 at mail.com</to>
   <queueID>EF90720AD</queueID>
   <queueID>DA1431965E</queueID>
   <queueID>755387301</queueID>
   <spamd>
	<score>15</score>
	<filtered>yes</filtered>
	<sendto>SPAM-FOUND at OUR-MAILSERVER.mail.com</sendto>
   </spamd>
</message>

The goal of this is to provide a web interface were we can see if the 
messages were filtered as spam (or deleted by our virus scanner).

Is it possible? Or do I have to scan / parse the file more than once?

Andi

-- 
Mozilla Thunderbird 1.5.0.7
Arch Linux



More information about the Python-list mailing list