Misuse of list comprehensions?

Tue May 20 10:21:05 EDT 2008

"John Salerno" <johnjsal at NOSPAMgmail.com> writes:

> "Diez B. Roggisch" <deets at nospam.web.de> wrote in message
> news:69g2bpF31ttuuU1 at mid.uni-berlin.de...
>> the above code is pretty much of a no-no because it has quadratic runtime
>> behavior.

> What's that mean, exactly?

It means that running time will increase with the square of the
problem size.  2.000.000 items will take 4 times as long as
1.000.000 items, and 5.000.000 items will take 25 times as long
as 1.000.000 items.

> Are you referring to both examples, or just the
> second one?

The problem is that the lookup you did to see if an item had
already been seen, is done by a linear search of a list.  The
search time is linearly proportional to the number of items on
the list 'new', but since the number of times you do that search
increases with the length of the input, the total runtime for
your function increases even more.

This thus applies to both your examples, since they really do the
same thing.

However, it actually depends on what your input is.  For the
runtime to increase with the square of the input length in your
function, the number of *unique* items on the input must be a
constant fraction of the input length.  By that I mean that, for
example, 5% of the input items are unique, so that if your input
is 100 items, then the list 'new' will be 5 items at the end of
the function, and if your input is 5.000.000 items, then 'new'
will become 250.000 items long.

This might not be the case in your use.  If you for example know
that the input only consists of integers between 0 and 10, then
'new' will never become longer than 11 items, regardless of
whether the input is 20 or 20.000.000.000 items.  The runtime of
your function then suddenly becomes linear in the problem size
instead of quadratic.

Even so, maintaining the set of already seen items as a set is
likely to be faster than keeping them as a list, even for a
fairly small number of unique items.

One small exception from this, though.  If only maintaining one
data structure (the ordered list 'new' in your code) makes your
working set fit within the processor cache, while maintaining two
data structures (a list for keeping the order, and a set for
searching) will make it *not* fit within the cache, then it
*might* actually be faster to just maintain the list.  Unlikely,
but not entirely impossible, and just a small change of the
problem size can change the balance again.

-- 
Thomas Bellman,   Lysator Computer Club,   LinkÃ¶ping University,  Sweden
"What sane person could live in this world   !  bellman @ lysator.liu.se
 and not be crazy?"  -- Ursula K LeGuin      !  Make Love -- Nicht Wahr!