iterating over a file with two pointers
Peter Otten
__peter__ at web.de
Thu Sep 19 03:23:36 EDT 2013
Roy Smith wrote:
>> Dave Angel <davea at davea.name> wrote (and I agreed with):
>>> I'd suggest you open the file twice, and get two file objects. Then you
>>> can iterate over them independently.
>
>
> On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
>> There's no need to use OS resources by opening the file twice or to
>> screw up the IO caching with seek().
>
> There's no reason NOT to use OS resources. That's what the OS is there
> for; to make life easier on application programmers. Opening a file twice
> costs almost nothing. File descriptors are almost as cheap as whitespace.
>
>> Peter's version holds just as many lines as is necessary in an
>> internal Python buffer and performs the minimum possible
>> amount of IO.
>
> I believe by "Peter's version", you're talking about:
>
>> from itertools import islice, tee
>>
>> with open("tmp.txt") as f:
>> while True:
>> for outer in f:
>> print outer,
>> if "*" in outer:
>> f, g = tee(f)
>> for inner in islice(g, 3):
>> print " ", inner,
del g # a good idea in the general case
>> break
>> else:
>> break
>
>
> There's this note from
> http://docs.python.org/2.7/library/itertools.html#itertools.tee:
>
>> This itertool may require significant auxiliary storage (depending on how
>> much temporary data needs to be stored). In general, if one iterator uses
>> most or all of the data before another iterator starts, it is faster to
>> use list() instead of tee().
>
>
> I have no idea how that interacts with the pattern above where you call
> tee() serially.
As I understand it the above says that
items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
pass
for pair in izip(a, b):
pass
stores 1000 items and can go on forever, but
items = infinite()
a, b = tee(items)
for item in a:
pass
will consume unbounded memory and that if items is finite using a list
instead of tee is more efficient. The documentation says nothing about
items = infinite()
a, b = tee(items)
del a
for item in b:
pass
so you have to trust Mr Hettinger or come up with a test case...
> You're basically doing
>
> with open("my_file") as f:
> while True:
> f, g = tee(f)
>
> Are all of those g's just hanging around, eating up memory, while waiting
> to be garbage collected? I have no idea.
I'd say you've just devised a nice test to find out ;)
> But I do know that no such
> problems exist with the two file descriptor versions.
The trade-offs are different. My version works with arbitrary iterators
(think stdin), but will consume unbounded amounts of memory when the inner
loop doesn't stop.
More information about the Python-list
mailing list