[Tutor] Iterable Understanding
Martin Walsh
mwalsh at mwalsh.org
Tue Nov 17 05:33:47 CET 2009
Stephen Nelson-Smith wrote:
> Nope - but I can look it up. The problem I have is that the source
> logs are rotated at 0400 hrs, so I need two days of logs in order to
> extract 24 hrs from 0000 to 2359 (which is the requirement). At
> present, I preprocess using sort, which works fine as long as the
> month doesn't change.
Still not sure without more detail, but IIRC from your previous posts,
your log entry timestamps are formatted with the abbreviated month name
instead of month number. Without the -M flag, the sort command will ...
well, erm ... sort the month names alphabetically. With the -M
(--month-sort) flag, they are sorted chronologically.
Just a guess, of course. I suppose this is drifting a bit off topic, in
any case, but it may still serve to demonstrate the importance of
converting your string based timestamps into something that can be
sorted accurately by your python code -- the most obvious being time or
datetime objects, IMHO.
<snip>
>> class LogFile(object):
>> def __init__(self, filename, jitter=10):
>> self.logfile = gzip.open(filename, 'r')
>> self.heap = []
>> self.jitter = jitter
>>
>> def __iter__(self):
>> while True:
>> for logline in self.logfile:
>> heappush(self.heap, (timestamp(logline), logline))
>> if len(self.heap) >= self.jitter:
>> break
>
> Really nice way to handle the batching of the initial heap - thank you!
>
>> try:
>> yield heappop(self.heap)
>> except IndexError:
>> raise StopIteration
<snip>
>> ... which probably won't preserve the order of log entries that have the
>> same timestamp, but if you need it to -- should be easy to accommodate.
>
> I don't think that is necessary, but I'm curious to know how...
I'd imagine something like this might work ...
class LogFile(object):
def __init__(self, filename, jitter=10):
self.logfile = open(filename, 'r')
self.heap = []
self.jitter = jitter
def __iter__(self):
line_count = 0
while True:
for logline in self.logfile:
line_count += 1
heappush(self.heap,
((timestamp(logline), line_count), logline))
if len(self.heap) >= self.jitter:
break
try:
yield heappop(self.heap)
except IndexError:
raise StopIteration
The key concept is to pass additional unique data to heappush, something
related to the order of lines from input. So, you could probably do
something with file.tell() also. But beware, it seems you can't reliably
tell() a file object opened in 'r' mode, used as an iterator[1] -- and
in python 3.x attempting to do so raises an IOError.
[1] http://mail.python.org/pipermail/python-list/2008-November/156865.html
HTH,
Marty
More information about the Tutor
mailing list