[Tutor] Iterable Understanding

Martin Walsh mwalsh at mwalsh.org
Tue Nov 17 05:33:47 CET 2009


Stephen Nelson-Smith wrote:
> Nope - but I can look it up.  The problem I have is that the source
> logs are rotated at 0400 hrs, so I need two days of logs in order to
> extract 24 hrs from 0000 to 2359 (which is the requirement).  At
> present, I preprocess using sort, which works fine as long as the
> month doesn't change.

Still not sure without more detail, but IIRC from your previous posts,
your log entry timestamps are formatted with the abbreviated month name
instead of month number. Without the -M flag, the sort command will ...
well, erm ... sort the month names alphabetically. With the -M
(--month-sort) flag, they are sorted chronologically.

Just a guess, of course. I suppose this is drifting a bit off topic, in
any case, but it may still serve to demonstrate the importance of
converting your string based timestamps into something that can be
sorted accurately by your python code -- the most obvious being time or
datetime objects, IMHO.

<snip>
>> class LogFile(object):
>>    def __init__(self, filename, jitter=10):
>>        self.logfile = gzip.open(filename, 'r')
>>        self.heap = []
>>        self.jitter = jitter
>>
>>    def __iter__(self):
>>        while True:
>>            for logline in self.logfile:
>>                heappush(self.heap, (timestamp(logline), logline))
>>                if len(self.heap) >= self.jitter:
>>                    break
> 
> Really nice way to handle the batching of the initial heap - thank you!
> 
>>            try:
>>                yield heappop(self.heap)
>>            except IndexError:
>>                raise StopIteration
<snip>
>> ... which probably won't preserve the order of log entries that have the
>> same timestamp, but if you need it to -- should be easy to accommodate.
> 
> I don't think  that is necessary, but I'm curious to know how...

I'd imagine something like this might work ...

class LogFile(object):
    def __init__(self, filename, jitter=10):
        self.logfile = open(filename, 'r')
        self.heap = []
        self.jitter = jitter

    def __iter__(self):
        line_count = 0
        while True:
            for logline in self.logfile:
                line_count += 1
                heappush(self.heap,
                   ((timestamp(logline), line_count), logline))
                if len(self.heap) >= self.jitter:
                    break
            try:
                yield heappop(self.heap)
            except IndexError:
                raise StopIteration

The key concept is to pass additional unique data to heappush, something
related to the order of lines from input. So, you could probably do
something with file.tell() also. But beware, it seems you can't reliably
tell() a file object opened in 'r' mode, used as an iterator[1] -- and
in python 3.x attempting to do so raises an IOError.

[1] http://mail.python.org/pipermail/python-list/2008-November/156865.html

HTH,
Marty


More information about the Tutor mailing list