[Tutor] Question about nested FOR looping

Magnus Lyckå magnus@thinkware.se
Wed Apr 30 16:43:02 2003


At 14:11 2003-04-30 -0400, stuart_clemons@us.ibm.com wrote:
>So, as a follow-up out of curiosity, which is the most efficient:
>
>1) Read the two files into lists and then process the lists, or
>
>2) Process directly from the file reads, using seek in the nested FOR, or
>
>3) Read one file into a list and process during the read of the second file ?

I can only suggest that you measure with reasonable data sets.
It might be a good thing to measure with different sizes of
the files to see how the timing changes as a function of the
file sizes.

import profile
profile.run('myFunction()')

I'm not sure either of these are optimal though. :)

If the sizes of of your files are M and N respectively, you
will have a runtime which is roughly proportional to M*N.
That's pretty lousy from a performance point of view.

It's the first logical approach, and unless performance
id really an issue, I wouldn't bother going any further,
but you're the one who asked about performance.

>As some added info, Scott Widney from tutor@python.org, provided a 
>dictionary construct for this same problem, ie, compare two lists, 
>identifying the duplicates.   I would guess that the dictionary construct 
>would be the most efficient method, based on random vs sequential 
>ordering.  Is this right ?

This should be faster, more like proportional to M+N in
runtime. I.e., of both and M and N gets five times bigger,
runtime will increase five times, not twenty five times as
with the nested list.

In practice, performance figures don't usually follow the
simplistic theories we apply. For instance, most functions
will drop significantly in performance when RAM is exhausted,
so if solution A is faster than solution B both when N=10000
(all in RAM) and when N=1000000 (both A and B swapping badly)
there might be an intermediate value where B is much faster
than A if it uses less memory (A is swapping but B isn't).
Whether and when this will happen will depend on how loaded
the machine is with other jobs etc, so it's not trivial to
know performance beforehand...

There are other solutions too... If you have two sorted lists
without duplicates, you could imagine working like this untested
code:

m = <a sorted list list with no duplicates>
s = <another sorted list list with no duplicates>

l_m, l_s = len(m), len(s)
m_pos = s_pos = 0

while m_pos < l_m and s_pos < l_s:
     m_val, s_val = m[m_pos].strip(), s[s_pos].strip()
     if m_val == s_val:
         print "this num is duplicate: ", m_val
         m_pos += 1
         s_pos += 1
     elif m_val < s_val:
         m_pos += 1
     else:
         s_pos += 1

I'm not sure about the exit condition. Maybe it should be
or instead of and? You figure it out...

Anyway, this algorithm should also be O(N+M), and I don't
think you can beat that for any solution where you need to
read the files.


--
Magnus Lycka (It's really Lyck&aring;), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program