[Tutor] Logfile multiplexing

Kent Johnson kent37 at tds.net
Tue Nov 10 13:49:55 CET 2009


On Tue, Nov 10, 2009 at 5:04 AM, Stephen Nelson-Smith
<sanelson at gmail.com> wrote:
> I have the following idea for multiplexing logfiles (ultimately into heapq):
>
> import gzip
>
> class LogFile:
>    def __init__(self, filename, date):
>        self.logfile = gzip.open(filename, 'r')
>        for logline in self.logfile:
>            self.line = logline
>            self.stamp = self.timestamp(self.line)
>            if self.stamp.startswith(date):
>                break
>
>    def timestamp(self, line):
>        return " ".join(self.line.split()[3:5])
>
>    def getline(self):
>        nextline = self.line
>        self.line = self.logfile.readline()
>        self.stamp = self.timestamp(self.line)
>        return nextline

One error is that the initial line will be the same as the first
response from getline(). So you should call getline() before trying to
access a line. Also you may need to filter all lines - what if there
is jitter at midnight, or the log rolls over before the end.

More important, though, you are pretty much writing your own iterator
without using the iterator protocol. I would write this as:

class LogFile:
   def __init__(self, filename, date):
       self.logfile = gzip.open(filename, 'r')
       self.date = date

   def __iter__(self)
       for logline in self.logfile:
           stamp = self.timestamp(logline)
           if stamp.startswith(date):
               yield (stamp, logline)

   def timestamp(self, line):
       return " ".join(self.line.split()[3:5])


Then you could use this as
l = LogFile("log1", "[Nov/05/2009")
for stamp, line in l:
    print stamp, line

or use one of the merging recipes I pointed to earlier.

> The idea is that I can then do:
>
> logs = [("log1", "[Nov/05/2009"), ("log2", "[Nov/05/2009"), ("log3",
> "[Nov/05/2009"), ("log4", "[Nov/05/2009")]

or
logs = [LogFile("log1", "[Nov/05/2009"), LogFile("log2",
"[Nov/05/2009"), LogFile("log3",
"[Nov/05/2009"), LogFile("log4", "[Nov/05/2009")]

> I've tested it with one log (15M compressed, 211M uncompressed), and
> it takes about 20 seconds to be ready to roll.
>
> However, then I get unexpected behaviour:
>
> ~/system/tools/magpie $ python
> Python 2.4.3 (#1, Jan 21 2009, 01:11:33)
> [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import magpie
>>>>magpie.l
> <magpie.LogFile instance at 0x2b8045765bd8>
>>>> magpie.l.stamp
> '[05/Nov/2009:04:02:07 +0000]'
>>>> magpie.l.getline()
> 89.151.119.195 - - [05/Nov/2009:04:02:07 +0000] "GET
> /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
> HTTP/1.1" 200 50 "-" "-"
>
> '89.151.119.195 - - [05/Nov/2009:04:02:07 +0000] "GET
> /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
> HTTP/1.1" 200 50 "-" "-"\n'
>>>> magpie.l.stamp
> ''
>>>> magpie.l.getline()
>
> ''
>>>>
>
> I expected to be able to call getline() and get more lines...

You are reading through the entire file on load because your timestamp
check is failing. You are filtering out the whole file and returning
just the last line. Check the dates you are supplying vs the actual
data - they don't match.

Kent


More information about the Tutor mailing list