Most efficient solution?

Mon Jul 16 18:01:02 EDT 2001

Jay Parlar <jparlar at home.com> wrote in message news:<mailman.995289809.28477.python-list at python.org>...
> I have a simple problem, but one that, without some optimization, might hurt the performance of my project. To sum it up, I 
> have two lists, we'll call them list A and list B, both lists contain only one-word strings (ie. both were generated by string.split() 
> performed on a large amount of text)
> 
> List B consists of my "stopwords", meaning, the words I don't want included in my final version of list A. So what I need to 
> do is check every item in list A, and if it occurs in list B, then I want to remove it from the final version of A. My first thought 
> would be:
> 
> for eachItem in A:
>     if eachItem in B:
>         A.remove(eachItem)
> 
> Now, this will work fine, however, the size of list A varies, and while list B is constant, it is quite large (more than 750 items). 
> List A can also, should the situation warrant, become quite huge as well. The main problem is that I have no idea of the 
> efficiency of "in" for large lists like this. Can anyone think of a possibly quicker way, or is this the best?

A Python list is a sequence. Thus the "in" and "remove" operations
must be O(n). Thus what you propose will be O(m*n**2), where n =
len(A)
and m = len(B). Thus for non-trivial sizes you may well have a
problem.

This question seems to have little to do with Python and lots to do
with (a) data structures and (b) common sense and (c) looking at prior
work in the field.

Some clues:
(a) (1) B should not be a sequence-like data structure. In whatever
language you use, try a dictionary aka hashed look-up table. If the
language doesn't have such an ADT built in, or available in a standard
library, or it's not fast and robust, abandon the language [Don't
panic -- Python is just fine]. (2) Iterating over a collection from
which you are removing items needs to be undertaken with care. (3)
Don't forget to concern yourself about how much memory your data
structures will occupy. Your project could run very fast until it runs
out of real memory and then become painfully slow. Hint: at least one
of the solutions proposed by other posters involved a copy of your A
list.

(b) Why are you putting noise words into A and then ripping them out
again? To quote Jon Bentley: "Do nothing gracefully". Even more
graceful than do(nothing) is do_not_do(anything)! Consider this:

Here is the loop that you didn't mention:
   for word in A_source: # or something similar ...
      A.append(word)
Try changing that to:
   for word in A_source:
      if not B.has_key(word):
         A.append(word)
   cue_the_fat_lady()

(c) Your school library should have copies of things like:

W. B. Frakes and R. Baeza-Yates eds., Information Retrieval : Data
Structures and Algorithms, Englewood Cliffs, N.J. : Prentice-Hall.

You could also try dipping into the a few of the 150 papers that cite
it; see
    http://citeseer.nj.nec.com/context/36679/0

If you are worried about how your project will be effected by len(A),
there is a book called "Managing Gigabytes".

[snip] 

> "Though there are many paths
> At the foot of the mountain
> All those who reach the top
> See the same moon."

Those using an O(n**3) mountain_top_search algorithm may find the moon
crumbled with age when and if they get to see it :-)