[Tutor] Logfile Manipulation

Stephen Nelson-Smith sanelson at gmail.com
Mon Nov 9 14:46:30 CET 2009


And the problem I have with the below is that I've discovered that the
input logfiles aren't strictly ordered - ie there is variance by a
second or so in some of the entries.

I can sort the biggest logfile (800M) using unix sort in about 1.5
mins on my workstation.  That's not really fast enough, with
potentially 12 other files....

Hrm...

S.

On Mon, Nov 9, 2009 at 1:35 PM, Stephen Nelson-Smith <sanelson at gmail.com> wrote:
> Hi,
>
>> If you create iterators from the files that yield (timestamp, entry)
>> pairs, you can merge the iterators using one of these recipes:
>> http://code.activestate.com/recipes/491285/
>> http://code.activestate.com/recipes/535160/
>
> Could you show me how I might do that?
>
> So far I'm at the stage of being able to produce loglines:
>
> #! /usr/bin/env python
> import gzip
> class LogFile:
>  def __init__(self, filename, date):
>   self.f=gzip.open(filename,"r")
>   for logline in self.f:
>     self.line=logline
>     self.stamp=" ".join(self.line.split()[3:5])
>     if self.stamp.startswith(date):
>       break
>
>  def getline(self):
>    ret=self.line
>    self.line=self.f.readline()
>    self.stamp=" ".join(self.line.split()[3:5])
>    return ret
>
> logs=[LogFile("a/access_log-20091105.gz","[05/Nov/2009"),LogFile("b/access_log-20091105.gz","[05/Nov/2009"),LogFile("c/access_log-20091105.gz","[05/Nov/2009")]
> while True:
>  print [x.stamp for x in logs]
>  nextline=min((x.stamp,x) for x in logs)
>  print nextline[1].getline()
>
>
> --
> Stephen Nelson-Smith
> Technical Director
> Atalanta Systems Ltd
> www.atalanta-systems.com
>



-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com


More information about the Tutor mailing list