iterating over a file with two pointers

Peter Otten __peter__ at web.de
Thu Sep 19 03:23:36 EDT 2013


Roy Smith wrote:

>> Dave Angel <davea at davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects.  Then you
>>> can iterate over them independently.
> 
> 
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
> 
> There's no reason NOT to use OS resources.  That's what the OS is there
> for; to make life easier on application programmers.  Opening a file twice
> costs almost nothing.  File descriptors are almost as cheap as whitespace.
> 
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
> 
> I believe by "Peter's version", you're talking about:
> 
>> from itertools import islice, tee
>> 
>> with open("tmp.txt") as f:
>>     while True:
>>         for outer in f:
>>             print outer,
>>             if "*" in outer:
>>                 f, g = tee(f)
>>                 for inner in islice(g, 3):
>>                     print "   ", inner,
                   del g # a good idea in the general case
>>                 break
>>         else:
>>             break
> 
> 
> There's this note from
> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
> 
>> This itertool may require significant auxiliary storage (depending on how
>> much temporary data needs to be stored). In general, if one iterator uses
>> most or all of the data before another iterator starts, it is faster to
>> use list() instead of tee().
> 
> 
> I have no idea how that interacts with the pattern above where you call
> tee() serially.  

As I understand it the above says that

items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
   pass
for pair in izip(a, b):
    pass

stores 1000 items and can go on forever, but

items = infinite()
a, b = tee(items)
for item in a:
    pass

will consume unbounded memory and that if items is finite using a list 
instead of tee is more efficient. The documentation says nothing about

items = infinite()
a, b = tee(items)
del a
for item in b:
   pass

so you have to trust Mr Hettinger or come up with a test case...

> You're basically doing
> 
> with open("my_file") as f:
> while True:
>     f, g = tee(f)
> 
> Are all of those g's just hanging around, eating up memory, while waiting
> to be garbage collected?  I have no idea.  

I'd say you've just devised a nice test to find out ;)

> But I do know that no such
> problems exist with the two file descriptor versions.

The trade-offs are different. My version works with arbitrary iterators 
(think stdin), but will consume unbounded amounts of memory when the inner 
loop doesn't stop.




More information about the Python-list mailing list