[Tutor] Iterable Understanding

Martin Walsh mwalsh at mwalsh.org
Sun Nov 15 18:44:24 CET 2009


Stephen Nelson-Smith wrote:
>> It's unclear from your previous posts (to me at least) -- are the
>> individual log files already sorted, in chronological order?
> 
> Sorry if I didn't make this clear.  No they're not.  They are *nearly*
> sorted - ie they're out by a few seconds, every so often, but they are
> in order at the level of minutes, or even in the order of a few
> seconds.
> 
> It was precisely because of this that I decided, following Alan's
> advice, to pre-filter the data.  I compiled a unix sort command to do
> this, and had a solution I was happy with, based on Kent's iterator
> example, fed into heapq.merge.
> 
> However, I've since discovered that the unix sort isn't reliable on
> the last and first day of the month.  So, I decided I'd need to sort
> each logfile first.  The code at the start of *this* thread does this
> - it uses a heapq per logfile and is able to produce a tuple of
> timestamp, logline, which will be in exact chronological order.  What
> I want to do is merge this output into a file.

Well, you haven't described the unreliable behavior of unix sort so I
can only guess, but I assume you know about the --month-sort (-M) flag?

I did misunderstand your intent for this thread, so thanks for
clarifying. The fact remains that if you are interested in using
heapq.merge, then you need to pass it iterable objects. And, I don't see
any reason to avoid adapting your approach to fit heapq.merge. How about
something like the following (completely untested) ...

import gzip
from heapq import heappush, heappop, merge

def timestamp(line):
    # replace with your own timestamp function
    # this appears to work with the sample logs I chose
    stamp = ' '.join(line.split(' ', 3)[:-1])
    return time.strptime(stamp, '%b %d %H:%M:%S')

class LogFile(object):
    def __init__(self, filename, jitter=10):
        self.logfile = gzip.open(filename, 'r')
        self.heap = []
        self.jitter = jitter

    def __iter__(self):
        while True:
            for logline in self.logfile:
                heappush(self.heap, (timestamp(logline), logline))
                if len(self.heap) >= self.jitter:
                    break
            try:
                yield heappop(self.heap)
            except IndexError:
                raise StopIteration

logs = [
    LogFile("/home/stephen/qa/ded1353/quick_log.gz"),
    LogFile("/home/stephen/qa/ded1408/quick_log.gz"),
    LogFile("/home/stephen/qa/ded1409/quick_log.gz")
]

merged_log = merge(*logs)
with open('/tmp/merged_log', 'w') as output:
    for stamp, line in merged_log:
        output.write(line)


... which probably won't preserve the order of log entries that have the
same timestamp, but if you need it to -- should be easy to accommodate.

HTH,
Marty


More information about the Tutor mailing list