itertools.izip brokeness
bonono at gmail.com
bonono at gmail.com
Tue Jan 3 06:02:14 EST 2006
But that is exactly the behaviour of python iterator, I don't see what
is broken.
izip/zip just read from the respectives streams and give back a tuple,
if it can get one from each, otherwise stop. And because python
iterator can only go in one direction, those consumed do lose in the
zip/izip calls.
I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.
rurpy at yahoo.com wrote:
> The code below should be pretty self-explanatory.
> I want to read two files in parallel, so that I
> can print corresponding lines from each, side by
> side. itertools.izip() seems the obvious way
> to do this.
>
> izip() will stop interating when it reaches the
> end of the shortest file. I don't know how to
> tell which file was exhausted so I just try printing
> them both. The exhausted one will generate a
> StopInteration, the other will continue to be
> iterable.
>
> The problem is that sometimes, depending on which
> file is the shorter, a line ends up missing,
> appearing neither in the izip() output, or in
> the subsequent direct file iteration. I would
> guess that it was in izip's buffer when izip
> terminates due to the exception on the other file.
>
> This behavior seems plain out broken, especially
> because it is dependent on order of izip's
> arguments, and not documented anywhere I saw.
> It makes using izip() for iterating files in
> parallel essentially useless (unless you are
> lucky enough to have files of the same length).
>
> Also, it seems to me that this is likely a problem
> with any iterables with different lengths.
> I am hoping I am missing something...
>
> #---------------------------------------------------------
> # Task: print contents of file1 in column 1, and
> # contents of file2 in column two. iterators and
> # izip() are the "obvious" way to do it.
>
> from itertools import izip
> import cStringIO, pdb
>
> def prt_files (file1, file2):
>
> for line1, line2 in izip (file1, file2):
> print line1.rstrip(), "\t", line2.rstrip()
>
> try:
> for line1 in file1:
> print line1,
> except StopIteration: pass
>
> try:
> for line2 in file2:
> print "\t",line2,
> except StopIteration: pass
>
> if __name__ == "__main__":
> # Use StringIO to simulate files. Real files
> # show the same behavior.
> f = cStringIO.StringIO
>
> print "Two files with same number of lines work ok."
> prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\nstu\n"))
>
> print "\nFirst file shorter is also ok."
> prt_files (f("abc\nde\n"), f("xyz\nwv\nstu\n"))
>
> print "\nSecond file shorter is a problem."
> prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\n"))
> print "What happened to \"fgh\" line that should be in column
> 1?"
>
> print "\nBut only a problem for one line."
> prt_files (f("abc\nde\nfgh\nijk\nlm\n"), f("xyz\nwv\n"))
> print "The line \"fgh\" is still missing, but following\n" \
> "line(s) are ok! Looks like izip() ate a line."
More information about the Python-list
mailing list