how to remove 50000 elements from a 100000 list?

Fri May 5 10:12:14 EDT 2006

Ju Hui wrote:
> I want to remove about 50000 elements from a list,which has 100000
> elements.
> sample code like below:
> 
>>>> a=range(10)
>>>> b=range(4)
>>>> for x in b:
> ...     a.remove(x)
> ...
>>>> a
> [4, 5, 6, 7, 8, 9]
> 
> when a and b is small size, it will finished quickly, but when a and b
> have many elements.
> such as:
>>>> a=range(100000)
>>>> b=range(50000)
>>>> for x in b:
> ...     a.remove(x)
> ...
> it will very slowly. Shall I change to another data structure and choos
> a better arithmetic?
> any suggestion is welcome.
> thanks a lot!
> 
A list isn't going to respond very well to random removals of large
numbers of elements like this.  Remember that it must do linear search
to find the value to be removed and then basically build a completely
new list with the remaining values.  Depending on the data in question,
you might be able to leverage things.  Are the two lists sorted?  If so
you can iterate over both of them and build a third list with the results.

This is not particularly 'elegant' but it is fast for sorted lists:

import time
a=range(100000)
b=range(50000)

start_time=time.time()
for x in b:
    a.remove(x)

stop_time=time.time()
print "brute force elapsed time=%.2f seconds" % (stop_time-start_time)

start_time=time.time()
n=[]
i1=0
i2=0
while 1:
    try: v1=a[i1]
    except IndexError:
        break

    try: v2=b[i2]
    except IndexError:
        n.extend(a[i1:])
        break

    if v1 < v2:
        n.append(v1)
        i1+=1
        continue

    elif v1 > v2:
        i2+=1
        continue

    else:
        i1+=1
        i2+=1
        continue

stop_time=time.time()
print "new elapsed time=%.2f seconds" % (stop_time-start_time)

start_time=time.time()
a=set(a)
b=set(b)
a.symmetric_difference(b)
stop_time=time.time()
print "sets elapsed time=%.2f seconds" % (stop_time-start_time)

brute force elapsed time=4.13 seconds
new elapsed time=0.05 seconds
sets elapsed time=0.03 seconds

There may be other ways using bisect or some other module.  If data
is random, unsorted or has duplicates, I would look at in-memory
database like SQLlite instead.

If the values are unique you can do something like:

a=set(range(100000))
b=set(range(50000))

a.symmetric_difference(b)

Which is the fastest way.

It all depends on the "real" data which we can't tell from your
test using range().

-Larry Bates