Removing None objects from a sequence
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Dec 13 06:07:53 EST 2008
On Fri, 12 Dec 2008 19:02:24 -0500, Terry Reedy wrote:
> Tim Chase wrote:
>>> If you want to literally remove None objects from a list....(or
>>> mutable sequence)
>>>
>>> def deNone(alist):
>>> n=len(alist)
>>> i=j=0
>>> while i < n:
>>> if alist[i] is not None:
>>> alist[j] = alist[i]
>>> j += 1
>>> i += 1
>>> alist[j:i] = []
>>>
>>> blist=[None,1,None,2,None,3,None,None,4,None] deNone(blist)
>>> print(blist)
>>>
>>> # prints [1, 2, 3, 4]
>>
>> ...wouldn't a cleaner way of doing this just be
>>
>> >>> blist=[None,1,None,2,None,3,None,None,4,None]
>
> No, making a filtered copy that is then copied back before being deleted
> is algorithmically much messier. My code does the minimum work
> necessary and is algorithmically cleaner.
Er what? You're joking, right?
Your function has eight lines, all of which are very low level: you're
dealing with not one but TWO list indexes yourself. You have three
internal names to deal with, although in fairness one is set once then
used as a constant from then on. The algorithm is unclear: try explaining
what you are doing in English, and see how difficult it is.
"Keep two indexes into the list. If the first index isn't pointing at
None, copy that value to the second index, which is either the same as
the first index, or pointing at None. Then advance the second index, and
the first index, as needed. It will never over-write a non-None value,
but just try proving it."
Contrast that with the alternative suggested by Tim:
def deNone2(alist):
alist[:] = [x for x in alist if x is not None]
It's one line, with one internal variable. You don't have to manipulate
index variables. The explanation is simple:
"Make a new list with the non-None values and assign it in-place to the
old list."
Clear, succinct, and understandable, with no corner cases like empty
lists to worry about.
Here's another low-level algorithm, the classical delete items in place
algorithm. Three lines, one index, lousy O(N**2) performance.
def deNone3(alist):
for i in xrange(len(alist)-1, -1, -1):
if alist[i] is None:
del alist[i]
Now, let's do a shoot-out. Do they return the same thing?
>>> masterlist = [None]*2 + range(5) + [None]*3 + range(5, 10) + [None]*4
+ [10, None, 11, None, 12, None]
>>> alist = masterlist[:]
>>> blist = masterlist[:]
>>> clist = masterlist[:]
>>> deNone(alist)
>>> deNone2(blist)
>>> deNone3(clist)
>>> alist == blist == clist
True
Which is faster?
>>> from timeit import Timer
>>> setup = "from __main__ import deNone, deNone2, deNone3;" \
... " from __main__ import masterlist as m"
>>> t1 = Timer("a=m[:];deNone(a)", setup)
>>> t2 = Timer("a=m[:];deNone2(a)", setup)
>>> t3 = Timer("a=m[:];deNone3(a)", setup)
>>> t1.repeat(number=10000)
[0.39079904556274414, 0.38915109634399414, 0.39700794219970703]
>>> t2.repeat(number=10000)
[0.17854905128479004, 0.1782989501953125, 0.17886185646057129]
>>> t3.repeat(number=10000)
[0.26834988594055176, 0.25835609436035156, 0.25850009918212891]
Not surprisingly, Tim's version is twice as fast as yours. Surprisingly,
even the deNone3 function is faster than yours.
What about a real test? None of these piddly toy data sets, with 25
items. Something *real*.
>>> masterlist = masterlist*100
>>> masterlist += range(1000)
>>> masterlist += [None]*1000
>>> masterlist += [None, 0]*1000
>>> masterlist += [1]*10000
>>> len(masterlist)
16500
>>> t1.repeat(number=100)
[2.1611621379852295, 1.2539350986480713, 1.2424759864807129]
>>> t2.repeat(number=100)
[0.93860101699829102, 0.44704914093017578, 0.41285014152526855]
>>> t3.repeat(number=100)
[4.5643420219421387, 3.216562032699585, 3.2176508903503418]
Not surprisingly, my version really suffers. But so does yours: it's now
three times slower than Tim's. I expect that Tim's version will look
better and better until the list is so huge that memory paging to disk
becomes a factor.
Now, sure, most of the work in Tim's version is executed in fast C code
instead of slow Python code. If we were programming in C, your version
might perform better relative to Tim's. But even if performance was
better, I wouldn't describe the algorithm as particularly clean.
--
Steven.
More information about the Python-list
mailing list