itertools.izip brokeness

rurpy at yahoo.com rurpy at yahoo.com
Tue Jan 3 05:19:13 EST 2006


The code below should be pretty self-explanatory.
I want to read two files in parallel, so that I
can print corresponding lines from each, side by
side.  itertools.izip() seems the obvious way
to do this.

izip() will stop interating when it reaches the
end of the shortest file.  I don't know how to
tell which file was exhausted so I just try printing
them both.  The exhausted one will generate a
StopInteration, the other will continue to be
iterable.

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration.  I would
guess that it was in izip's buffer when izip
terminates due to the exception on the other file.

This behavior seems plain out broken, especially
because it is dependent on order of izip's
arguments, and not documented anywhere I saw.
It makes using izip() for iterating files in
parallel essentially useless (unless you are
lucky enough to have files of the same length).

Also, it seems to me that this is likely a problem
with any iterables with different lengths.
I am hoping I am missing something...

#---------------------------------------------------------
# Task: print contents of file1 in column 1, and
# contents of file2 in column two.  iterators and
# izip() are the "obvious" way to do it.

from itertools import izip
import cStringIO, pdb

def prt_files (file1, file2):

        for line1, line2 in izip (file1, file2):
            print line1.rstrip(), "\t", line2.rstrip()

        try:
            for line1 in file1:
                print line1,
        except StopIteration: pass

        try:
            for line2 in file2:
                print "\t",line2,
        except StopIteration: pass

if __name__ == "__main__":
        # Use StringIO to simulate files.  Real files
        # show the same behavior.
        f = cStringIO.StringIO

        print "Two files with same number of lines work ok."
        prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\nstu\n"))

        print "\nFirst file shorter is also ok."
        prt_files (f("abc\nde\n"), f("xyz\nwv\nstu\n"))

        print "\nSecond file shorter is a problem."
        prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\n"))
        print "What happened to \"fgh\" line that should be in column
1?"

        print "\nBut only a problem for one line."
        prt_files (f("abc\nde\nfgh\nijk\nlm\n"), f("xyz\nwv\n"))
        print "The line \"fgh\" is still missing, but following\n" \
            "line(s) are ok!  Looks like izip() ate a line."




More information about the Python-list mailing list