Removing None objects from a sequence

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Dec 13 06:07:53 EST 2008


On Fri, 12 Dec 2008 19:02:24 -0500, Terry Reedy wrote:

> Tim Chase wrote:
>>> If you want to literally remove None objects from a list....(or
>>> mutable sequence)
>>>
>>> def deNone(alist):
>>>    n=len(alist)
>>>    i=j=0
>>>    while i < n:
>>>      if alist[i] is not None:
>>>        alist[j] = alist[i]
>>>        j += 1
>>>      i += 1
>>>    alist[j:i] = []
>>>
>>> blist=[None,1,None,2,None,3,None,None,4,None] deNone(blist)
>>> print(blist)
>>>
>>> # prints [1, 2, 3, 4]
>> 
>> ...wouldn't a cleaner way of doing this just be
>> 
>>   >>> blist=[None,1,None,2,None,3,None,None,4,None]
> 
> No, making a filtered copy that is then copied back before being deleted
> is algorithmically much messier.  My code does the minimum work
> necessary and is algorithmically cleaner.

Er what? You're joking, right?

Your function has eight lines, all of which are very low level: you're 
dealing with not one but TWO list indexes yourself. You have three 
internal names to deal with, although in fairness one is set once then 
used as a constant from then on. The algorithm is unclear: try explaining 
what you are doing in English, and see how difficult it is.

"Keep two indexes into the list. If the first index isn't pointing at 
None, copy that value to the second index, which is either the same as 
the first index, or pointing at None. Then advance the second index, and 
the first index, as needed. It will never over-write a non-None value, 
but just try proving it."

Contrast that with the alternative suggested by Tim:

def deNone2(alist):
    alist[:] = [x for x in alist if x is not None]


It's one line, with one internal variable. You don't have to manipulate 
index variables. The explanation is simple:

"Make a new list with the non-None values and assign it in-place to the 
old list."

Clear, succinct, and understandable, with no corner cases like empty 
lists to worry about.


Here's another low-level algorithm, the classical delete items in place 
algorithm. Three lines, one index, lousy O(N**2) performance.

def deNone3(alist):
    for i in xrange(len(alist)-1, -1, -1):
        if alist[i] is None:
            del alist[i]

Now, let's do a shoot-out. Do they return the same thing?

>>> masterlist = [None]*2 + range(5) + [None]*3 + range(5, 10) + [None]*4 
+ [10, None, 11, None, 12, None]
>>> alist = masterlist[:]
>>> blist = masterlist[:]
>>> clist = masterlist[:]
>>> deNone(alist)
>>> deNone2(blist)
>>> deNone3(clist)
>>> alist == blist == clist
True


Which is faster?

>>> from timeit import Timer
>>> setup = "from __main__ import deNone, deNone2, deNone3;" \
... " from __main__ import masterlist as m"
>>> t1 = Timer("a=m[:];deNone(a)", setup)
>>> t2 = Timer("a=m[:];deNone2(a)", setup)
>>> t3 = Timer("a=m[:];deNone3(a)", setup)
>>> t1.repeat(number=10000)
[0.39079904556274414, 0.38915109634399414, 0.39700794219970703]
>>> t2.repeat(number=10000)
[0.17854905128479004, 0.1782989501953125, 0.17886185646057129]
>>> t3.repeat(number=10000)
[0.26834988594055176, 0.25835609436035156, 0.25850009918212891]


Not surprisingly, Tim's version is twice as fast as yours. Surprisingly, 
even the deNone3 function is faster than yours.

What about a real test? None of these piddly toy data sets, with 25 
items. Something *real*.

>>> masterlist = masterlist*100
>>> masterlist += range(1000)
>>> masterlist += [None]*1000
>>> masterlist += [None, 0]*1000
>>> masterlist += [1]*10000
>>> len(masterlist)
16500
>>> t1.repeat(number=100)
[2.1611621379852295, 1.2539350986480713, 1.2424759864807129]
>>> t2.repeat(number=100)
[0.93860101699829102, 0.44704914093017578, 0.41285014152526855]
>>> t3.repeat(number=100)
[4.5643420219421387, 3.216562032699585, 3.2176508903503418]

Not surprisingly, my version really suffers. But so does yours: it's now 
three times slower than Tim's. I expect that Tim's version will look 
better and better until the list is so huge that memory paging to disk 
becomes a factor.

Now, sure, most of the work in Tim's version is executed in fast C code 
instead of slow Python code. If we were programming in C, your version 
might perform better relative to Tim's. But even if performance was 
better, I wouldn't describe the algorithm as particularly clean. 


-- 
Steven.



More information about the Python-list mailing list