[Tutor] Logfile multiplexing
Kent Johnson
kent37 at tds.net
Tue Nov 10 13:49:55 CET 2009
On Tue, Nov 10, 2009 at 5:04 AM, Stephen Nelson-Smith
<sanelson at gmail.com> wrote:
> I have the following idea for multiplexing logfiles (ultimately into heapq):
>
> import gzip
>
> class LogFile:
> def __init__(self, filename, date):
> self.logfile = gzip.open(filename, 'r')
> for logline in self.logfile:
> self.line = logline
> self.stamp = self.timestamp(self.line)
> if self.stamp.startswith(date):
> break
>
> def timestamp(self, line):
> return " ".join(self.line.split()[3:5])
>
> def getline(self):
> nextline = self.line
> self.line = self.logfile.readline()
> self.stamp = self.timestamp(self.line)
> return nextline
One error is that the initial line will be the same as the first
response from getline(). So you should call getline() before trying to
access a line. Also you may need to filter all lines - what if there
is jitter at midnight, or the log rolls over before the end.
More important, though, you are pretty much writing your own iterator
without using the iterator protocol. I would write this as:
class LogFile:
def __init__(self, filename, date):
self.logfile = gzip.open(filename, 'r')
self.date = date
def __iter__(self)
for logline in self.logfile:
stamp = self.timestamp(logline)
if stamp.startswith(date):
yield (stamp, logline)
def timestamp(self, line):
return " ".join(self.line.split()[3:5])
Then you could use this as
l = LogFile("log1", "[Nov/05/2009")
for stamp, line in l:
print stamp, line
or use one of the merging recipes I pointed to earlier.
> The idea is that I can then do:
>
> logs = [("log1", "[Nov/05/2009"), ("log2", "[Nov/05/2009"), ("log3",
> "[Nov/05/2009"), ("log4", "[Nov/05/2009")]
or
logs = [LogFile("log1", "[Nov/05/2009"), LogFile("log2",
"[Nov/05/2009"), LogFile("log3",
"[Nov/05/2009"), LogFile("log4", "[Nov/05/2009")]
> I've tested it with one log (15M compressed, 211M uncompressed), and
> it takes about 20 seconds to be ready to roll.
>
> However, then I get unexpected behaviour:
>
> ~/system/tools/magpie $ python
> Python 2.4.3 (#1, Jan 21 2009, 01:11:33)
> [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import magpie
>>>>magpie.l
> <magpie.LogFile instance at 0x2b8045765bd8>
>>>> magpie.l.stamp
> '[05/Nov/2009:04:02:07 +0000]'
>>>> magpie.l.getline()
> 89.151.119.195 - - [05/Nov/2009:04:02:07 +0000] "GET
> /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
> HTTP/1.1" 200 50 "-" "-"
>
> '89.151.119.195 - - [05/Nov/2009:04:02:07 +0000] "GET
> /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
> HTTP/1.1" 200 50 "-" "-"\n'
>>>> magpie.l.stamp
> ''
>>>> magpie.l.getline()
>
> ''
>>>>
>
> I expected to be able to call getline() and get more lines...
You are reading through the entire file on load because your timestamp
check is failing. You are filtering out the whole file and returning
just the last line. Check the dates you are supplying vs the actual
data - they don't match.
Kent
More information about the Tutor
mailing list