NEWBIE QUESTIION: Comparing Lists in Python

Mark McEahern marklists at mceahern.com
Thu May 2 09:45:38 EDT 2002


[Alex Martelli]
> The way to get O(N) behavior is to construct auxiliary dictionaries:
>
> dict1 = dict(zip(list1,list1))
> returnList2 = [x for x in list2 if x not in dict1]
>
> "x not in C" takes O(N) time if container C is a sequence with N
> elements, but (roughly) O(1) time if C is a dictionary.  Building a
> dictionary of N items is roughly O(N).

Alex, thanks for the zip tip.  I don't usually reach for that particular
hammer--this usage broadens my horizons.

I whipped together the program below to time the difference between a raw
compare and a compare with an auxiliary dictionary (in the program I
contrast these two approaches as raw and zip).  It seems that with a small
list size (10-100), the performance boost of the zip is negligible?

Here are some results:

Usage: time_not_in.py count share.

count is the size of each list.
share is the number of elements they share.

$ time_not_in.py 100 10
raw: 0.004 seconds.
zip: 0.001 seconds.

$ time_not_in.py 1000 100
raw: 0.394 seconds.
zip: 0.010 seconds.

$ time_not_in.py 10000 1000
raw: 44.949 seconds.
zip: 0.176 seconds.

Cheers,

// mark

#! /usr/bin/env python

import time

def not_in_list_zip(l1, l2):
  """
  Return a list of elements from l1 that are not in l2.
  """
  d = dict(zip(l2, l2))
  return [x for x in l1 if x not in d]

def not_in_list_raw(l1, l2):
  """
  Return a list of elements from l1 that are not in l2.
  """
  return [x for x in l1 if x not in l2]

def time_raw(l1, l2):
    f = not_in_list_raw
    return time_it(f, l1, l2)

def time_zip(l1, l2):
    f = not_in_list_zip
    return time_it(f, l1, l2)

def time_it(f, l1, l2):
    start = time.time()
    d1 = f(l1, l2)
    d2 = f(l2, l1)
    end = time.time()
    return d1, d2, end - start

def show_time(s, t):
    print "%s: %1.3f seconds." % (s, t)

def main():
    import os, sys
    prog = os.path.basename(sys.argv[0])
    usage = "Usage: %s count share." % prog
    default_count = 10
    default_share = 2
    try:
        count = int(sys.argv[1])
    except IndexError:
        count = default_count
    try:
        share = int(sys.argv[2])
    except IndexError:
        share = default_share
    l1 = range(0, count)
    l2 = range(count - share, 2 * count - share)
    raw_d1, raw_d2, raw_t = time_raw(l1, l2)
    zip_d1, zip_d2, zip_t = time_zip(l1, l2)
    show_time("raw", raw_t)
    show_time("zip", zip_t)

if __name__ == "__main__":
    main()






More information about the Python-list mailing list