[Tutor] Iterable Understanding

Martin Walsh mwalsh at mwalsh.org
Sun Nov 15 10:11:48 CET 2009


Stephen Nelson-Smith wrote:
> I think I'm having a major understanding failure.

Perhaps this will help ...
http://www.learningpython.com/2009/02/23/iterators-iterables-and-generators-oh-my/

<snip>
> So in essence this:
> 
> logs = [ LogFile( "/home/stephen/qa/ded1353/quick_log.gz", "04/Nov/2009" ),
>          LogFile( "/home/stephen/qa/ded1408/quick_log.gz", "04/Nov/2009" ),
>          LogFile( "/home/stephen/qa/ded1409/quick_log.gz", "04/Nov/2009" ) ]
> 
> Gives me a list of LogFiles - each of which has a getline() method,
> which returns a tuple.
> 
> I thought I could merge iterables using Kent's recipe, or just with
> heapq.merge()

But, at this point are your LogFile instances even iterable? AFAICT, the
answer is no, and I think you should want them to be in order to use
heapq.merge. Have a look at the documentation
(http://docs.python.org/library/stdtypes.html#iterator-types) and then
re-read Kent's advice, in your previous thread ('Logfile multiplexing'),
about "using the iterator protocol" (__iter__).

And, judging by the heapq docs
(http://docs.python.org/library/heapq.html#heapq.merge) ...

"""
Merge multiple sorted inputs into a single sorted output (for example,
merge timestamped entries from multiple log files). Returns an iterator
over the sorted values.
"""

... using heapq.merge appears to be a reasonable approach.

You might also be interested to know, that while heapq.merge is(was) new
in 2.6, it's implementation is very similar (read: nearly identical) to
the one of the cookbook recipes referenced by Kent.

It's unclear from your previous posts (to me at least) -- are the
individual log files already sorted, in chronological order? I'd imagine
they are, being log files. But, let's say you were to run your
hypothetical merge script against only one file -- would the output to
be identical to the input? If not, then you'll want to sort the inputs
first.

> 
> But how do I get from a method that can produce a tuple, to some
> mergable iterables?
> 

I'm going to re-word this question slightly to "How can I modify the
LogFile class, for instances to be usable by heapq.merge?" and make an
attempt to answer. The following borrows heavily from Kent's iterator
example, but removes your additional line filtering (if
self.stamp.startswith(date), etc) to, hopefully, make it clearer.

import time, gzip, heapq

def timestamp(line):
    # replace with your own timestamp function
    # this appears to work with the sample logs I chose
    stamp = ' '.join(line.split(' ', 3)[:-1])
    return time.strptime(stamp, '%b %d %H:%M:%S')

class LogFile(object):
    def __init__(self, filename):
        self.logfile = gzip.open(filename, 'r')

    def __iter__(self):
        for logline in self.logfile:
            yield (timestamp(logline), logline)

logs = [
    LogFile("/home/stephen/qa/ded1353/quick_log.gz"),
    LogFile("/home/stephen/qa/ded1408/quick_log.gz"),
    LogFile("/home/stephen/qa/ded1409/quick_log.gz")
]

merged = heapq.merge(*logs)
with open('/tmp/merged_log', 'w') as output:
    for stamp, line in merged:
        output.write(line)

Will it be fast enough? I have no clue.

Good luck!
Marty





More information about the Tutor mailing list