[Tutor] Simple way to compare Two lists

Fri Aug 10 14:17:50 CEST 2007

Hi,

You're really asking about optimisation incidentally.

On Friday 10 August 2007 10:54, Jaggo wrote:
> Hello!
>
> I desperately need a simple way to compare whether any item of SmallList is
> in BigList.

A simple way:

True in [x in BigList for x in SmallList]

Not efficient necessarily, but it is a simple way. Sounds like you want a 
faster way.

> My current way,
>
> def IsAPartOfList(SmallList,BigList)
> for item in SmallList:
>     if item in BigList:
>         return True
> return False

This is directly equivalent to my approach above BUT I *expect* your approach 
will run faster. *Since you short circuit the result by returning true.

> Takes up waay too much time to process.
> Can anyone think of any better way?

So what you're really after is a quicker way, yes?

The problem you have here is worst case IsAPartOfList will run through all the 
elements of BigList a number of times (specifically len(SmallList).

Sorting the two lists and then working through may be quicker, but I'm struck 
that the cost of sorting BigList itself may be more costly than you want 
given you say the above function you wrote (which is pretty efficient) is too 
slow.

Two immediate thoughts - use psyco on the function. It's a simple enough 
function and psyco may be able to give you a dramatic speedup.

Alternatively you could reverse your loop. Rather than do SmallList * BigList 
number of comparisons, if there are duplicates in BigList (rather than being 
a set), then you could iterate through BigList more efficiently.

If I read your spec correctly, then the following holds true even though its 
the inverse of what you stated as the probem:

def IsAPartOfList_2(SmallList,BigList)
  for item in BigList:
      if item in SmallList:
          return True
  return False

Incidentally this simple reversal may in itself be quicker depending on your 
data. If the numbers are generated from (say) a list of events and you're 
correlating events, then you may find this quicker:

def IsAPartOfList_3(SmallList,BigList)
  for item in reversed(BigList):
      if item in SmallList:
          return True
  return False

Since that exploits domain knowledge about the two lists. This becomes 
especially likely if the lists are likely to be already sorted because 
they've come from some ordered source (eg a server log).

Furthermore if BigList is a list of times all in ascending order, and 
SmallList is a list of times in ascending order, and the latter represents a 
list of "recent" events and BigList represents a list of all events, then you 
can make another optimsation to take advantage of that knowledge:

def IsAPartOfList_4(SmallList,BigList)
  for item in reversed(BigList):
      if item in reversed(SmallList):
          return True
  return False

Why?
Consider the following two lists:
BigList: 
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
   21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]

SmallList:
   [33, 34, 35, 36]

Then IsAPartOfList_4 will make 13 comparisons before returning True. Whereas 
IsAPartOfList_2 will make 133 comparisons, and your original will make 34 
comparisons.

However, if you have duplicates in BigList, we can skip the "if item in 
SmallList" every time we have a duplicate:

def IsAPartOfList(SmallList,BigList)
  seen = {}
  for item in BigList:
      if not seen.get(item, False):
         if item in SmallList:
             return True
  return False

This will only give you a potential speedup IF two things hold:
    * BigList contains duplicates
    * seen.get(item, False) is quicker 

IF these things don't hold then you won't see any improvement. Personally I'd 
try using psyco on the function since then you leave the function looking 
clean and simple.

Which is things is quickest will depend heavily on the likely structure of the 
data, likelihood of duplicate, ordering and likely similarities between data.

Overall though you have two choices:
   * Exploit your knowledge of the distribution and ordering of values
   * Use psyco

These aren't mutually exclusive options :-)

Michael.