iterating over a file with two pointers

Oscar Benjamin oscar.j.benjamin at gmail.com
Thu Sep 19 10:16:43 EDT 2013


On 19 September 2013 08:23, Peter Otten <__peter__ at web.de> wrote:
> Roy Smith wrote:
>>
>> I believe by "Peter's version", you're talking about:
>>
>>> from itertools import islice, tee
>>>
>>> with open("tmp.txt") as f:
>>>     while True:
>>>         for outer in f:
>>>             print outer,
>>>             if "*" in outer:
>>>                 f, g = tee(f)
>>>                 for inner in islice(g, 3):
>>>                     print "   ", inner,
>                    del g # a good idea in the general case
>>>                 break
>>>         else:
>>>             break
>>
>> There's this note from
>> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>>
>>> This itertool may require significant auxiliary storage (depending on how
>>> much temporary data needs to be stored). In general, if one iterator uses
>>> most or all of the data before another iterator starts, it is faster to
>>> use list() instead of tee().

This is referring to the case where your two iterators get out of sync
by a long way. If you only consume 3 extra items it will just store
those 3 items in a list.

>> I have no idea how that interacts with the pattern above where you call
>> tee() serially.

Fair point.

>
> As I understand it the above says that
>
> items = infinite()
> a, b = tee(items)
> for item in islice(a, 1000):
>    pass
> for pair in izip(a, b):
>     pass
>
> stores 1000 items and can go on forever, but
>
> items = infinite()
> a, b = tee(items)
> for item in a:
>     pass
>
> will consume unbounded memory and that if items is finite using a list
> instead of tee is more efficient. The documentation says nothing about
>
> items = infinite()
> a, b = tee(items)
> del a
> for item in b:
>    pass
>
> so you have to trust Mr Hettinger or come up with a test case...
>
>> You're basically doing
>>
>> with open("my_file") as f:
>> while True:
>>     f, g = tee(f)
>>
>> Are all of those g's just hanging around, eating up memory, while waiting
>> to be garbage collected?  I have no idea.
>
> I'd say you've just devised a nice test to find out ;)

$ cat tee.py
#!/usr/bin/env python

import sys
from itertools import tee

items = iter(range(int(sys.argv[1])))

while True:
    for x in items:
        items, discard = tee(items)
        break
    else:
        break

print(x)

$ time py -3.3 ./tee.py 100000000
99999999

real    1m47.711s
user    0m0.015s
sys     0m0.000s

While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).

When you call tee(iterator) it creates two _tee objects and one
_teelist object. The _teelist object stores all of the items that have
been seen by only one of _tee1 and _tee2, a reference to iterator and
a flag indicating which _tee object has seen more items. When say
_tee2 is deallocated the _teelist becomes singly owned and no longer
needs to ever accumulate items (so it doesn't). So the dereferenced
discard will not cause an arbitrary growth in memory usage.

There is a separate problem which is that if you call tee() multiple
times then you end up with a chain of tees and each next call would go
through each one of them. This would cause a linear growth in the time
taken to call next() leading to quadratic time performance overall.
However, this does not occur with the script I showed above. In
principle it's possible for a _tee object to realise that there is a
chain of singly owned _tee and _teelist objects and bypass them
calling next() on the original iterator but I don't know if this is
what happens.

However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.

Here's an alternate iterator recipe that doesn't depend on these optimisations:

from itertools import islice
from collections import deque

class Peekable(object):

    def __init__(self, iterable):
        self.iterator = iter(iterable)
        self.peeked = deque()

    def __iter__(self):
        while True:
            while self.peeked:
                yield self.peeked.popleft()
            yield next(self.iterator)

    def peek(self):
        for p in self.peeked:
            yield p
        for val in self.iterator:
            self.peeked.append(val)
            yield val

with open("tmp.txt") as f:
    f = Peekable(f)
    for outer in f:
        print outer,
        if "*" in outer:
            for inner in islice(f.peek(), 3):
                print "   ", inner,


Oscar



More information about the Python-list mailing list