itertools.izip brokeness

Tue Jan 3 06:02:14 EST 2006

But that is exactly the behaviour of python iterator, I don't see what
is broken.

izip/zip just read from the respectives streams and give back a tuple,
if it can get one from each, otherwise stop. And because python
iterator can only go in one direction, those consumed do lose in the
zip/izip calls.

I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.

rurpy at yahoo.com wrote:
> The code below should be pretty self-explanatory.
> I want to read two files in parallel, so that I
> can print corresponding lines from each, side by
> side.  itertools.izip() seems the obvious way
> to do this.
>
> izip() will stop interating when it reaches the
> end of the shortest file.  I don't know how to
> tell which file was exhausted so I just try printing
> them both.  The exhausted one will generate a
> StopInteration, the other will continue to be
> iterable.
>
> The problem is that sometimes, depending on which
> file is the shorter, a line ends up missing,
> appearing neither in the izip() output, or in
> the subsequent direct file iteration.  I would
> guess that it was in izip's buffer when izip
> terminates due to the exception on the other file.
>
> This behavior seems plain out broken, especially
> because it is dependent on order of izip's
> arguments, and not documented anywhere I saw.
> It makes using izip() for iterating files in
> parallel essentially useless (unless you are
> lucky enough to have files of the same length).
>
> Also, it seems to me that this is likely a problem
> with any iterables with different lengths.
> I am hoping I am missing something...
>
> #---------------------------------------------------------
> # Task: print contents of file1 in column 1, and
> # contents of file2 in column two.  iterators and
> # izip() are the "obvious" way to do it.
>
> from itertools import izip
> import cStringIO, pdb
>
> def prt_files (file1, file2):
>
>         for line1, line2 in izip (file1, file2):
>             print line1.rstrip(), "\t", line2.rstrip()
>
>         try:
>             for line1 in file1:
>                 print line1,
>         except StopIteration: pass
>
>         try:
>             for line2 in file2:
>                 print "\t",line2,
>         except StopIteration: pass
>
> if __name__ == "__main__":
>         # Use StringIO to simulate files.  Real files
>         # show the same behavior.
>         f = cStringIO.StringIO
>
>         print "Two files with same number of lines work ok."
>         prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\nstu\n"))
>
>         print "\nFirst file shorter is also ok."
>         prt_files (f("abc\nde\n"), f("xyz\nwv\nstu\n"))
>
>         print "\nSecond file shorter is a problem."
>         prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\n"))
>         print "What happened to \"fgh\" line that should be in column
> 1?"
>
>         print "\nBut only a problem for one line."
>         prt_files (f("abc\nde\nfgh\nijk\nlm\n"), f("xyz\nwv\n"))
>         print "The line \"fgh\" is still missing, but following\n" \
>             "line(s) are ok!  Looks like izip() ate a line."