[Tutor] Question about nested FOR looping

Michael Janssen Janssen@rz.uni-frankfurt.de
Wed Apr 30 15:25:02 2003


On Wed, 30 Apr 2003 stuart_clemons@us.ibm.com wrote:

> So, as a follow-up out of curiosity, which is the most efficient:
>
> 1) Read the two files into lists and then process the lists, or

with (vary) large files this is forbidden because of memory-consumption.
With reasonable small files it can be handy espessially if you need random
access (not looping line-by-line).

> 2) Process directly from the file reads, using seek in the nested FOR, or

*can* be good for memory. NB: "for line in fo.readlines():" build a
(possibly) huge list of lines within memory. fo.xreadlines() is better or
recent-version-of-python "for line in fo:"

> 3) Read one file into a list and process during the read of the second
> file ?

depends on wich is the big file.

The problem with all these approaches is, that it iter over len(list1) x
len(list2). With large lists you will have many comparisions (comparing
e.g. 10,000 x 10,000 entries *this way* can easily last for some minutes).

Since dictinary lookup is fast it is much better to store the lines of
first file in a dictionary and look for each line of the second file if
the dictionary "has_key(line)" and report the duplicates.

> As some added info, Scott Widney from tutor@python.org, provided a
> dictionary construct for this same problem, ie, compare two lists,
> identifying the duplicates.   I would guess that the dictionary construct
> would be the most efficient method, based on random vs sequential
> ordering.  Is this right ?

Yes, because dictionary lookup is fast because of random ordering.

say len(list1) --> 10
    len(list2) --> 20

for line in list1:  # iters 10 times
    for line2 in list2: # iters 10 x 20 times
        if line == line2: # 200 comparisions
             pass

dict1 = dict([(x, 1) for x in list1]) # builds a dictionary from list1

for line in list2:  # iters 20 times
    if dict1.has_key(line): # 20 lookups
         pass

NB:
for line in list1:  # iters 10 times
    if line in list1: # 10 lookups
         pass

is possibly slower since looups in lists are slow. Could well be that this
is as slow as the dubble-for-loop solution in case the "if line in list1"
lookup works internally as a iteration through each element.

Michael