Remove empty strings from list

Tue Sep 15 04:16:38 EDT 2009

Helvin a écrit :
> Hi,
> 
> Sorry I did not want to bother the group, but I really do not
> understand this seeming trivial problem.
> I am reading from a textfile, where each line has 2 values, with
> spaces before and between the values.
> I would like to read in these values, but of course, I don't want the
> whitespaces between them.
> I have looked at documentation, and how strings and lists work, but I
> cannot understand the behaviour of the following:
     line = f.readline()
>   line = line.lstrip() # take away whitespace at the beginning of the
> readline.

file.readline returns the line with the ending newline character (which 
is considered whitespace by the str.strip method), so you may want to 
use line.strip instead of line.lstrip

>  list = line.split(' ')

Slightly OT but : don't use builtin types or functions names as 
identifiers - this shadows the builtin object.

Also, the default behaviour of str.split is to split on whitespaces and 
remove the delimiter. You would have better results not specifying the 
delimiters here:

 >>> " a  a  a  a ".split(' ')
['', 'a', '', 'a', '', 'a', '', 'a', '']
 >>> " a  a  a  a ".split()
['a', 'a', 'a', 'a']
 >>>

> # the list has empty strings in it, so now,
> remove these empty strings

A problem you could have avoided right from the start !-)

>  for item in list:
>    if item is ' ':

Don't use identity comparison when you want to test for equality. It 
happens to kind of work in your above example but only because CPython 
implements a cache for _some_ small strings, but you should _never_ rely 
on such implementation details. A string containing accented characters 
would not have been cached:
 >>> s = 'ééé'
 >>> s is 'ééé'
False
 >>>

Also, this is surely not your actual code : ' ' is not an empty string, 
it's a string with a single space character. The empty string is ''. And 
FWIW, empty strings (like most empty sequences and collections, all 
numerical zeros, and the None object) have a false value in a boolean 
context, so you can just test the string directly:

for s in ['', 0, 0.0, [], {}, (), None]:
    if not s:
       print "'%s' is empty, so it's false" % str(s)

> 	print 'discard these: ',item
> 	index = list.index(item)
> 	del list[index]         # remove this item from the list

And then you do have a big problem : the internal pointer used by the 
iterator is not in sync with the list anymore, so the next iteration 
will skip one item.

As general rule : *don't* add / remove elements to/from a sequence while 
iterating over it. If you really need to modify the sequence while 
iterating over it, do a reverse iteration - but there are usually better 
solutions.

>    else:
> 	print 'keep this: ',item
> The problem is,

Make it a plural - there's more than 1 problem here !-)

> when my list is :  ['44', '', '', '', '', '',
> '0.000000000\n']
> The output is:
>     len of list:  7
>     keep this:  44
>     discard these:
>     discard these:
>     discard these:
> So finally the list is:   ['44', '', '', '0.000000000\n']
> The code above removes all the empty strings in the middle, all except
> two. My code seems to miss two of the empty strings.
> 
> Would you know why this is occuring?

cf above... and below:

 >>> alist = ['44', '', '', '', '', '', '0.000000000']
 >>> for i, it in enumerate(alist):
...     print 'i : %s -  it : "%s"' % (i, it)
...     if not it:
...         del alist[idx]
...     print "alist is now %s" % alist
...
i : 0 -  it : "44"
alist is now ['44', '', '', '', '', '', '0.000000000']
i : 1 -  it : ""
alist is now ['44', '', '', '', '', '0.000000000']
i : 2 -  it : ""
alist is now ['44', '', '', '', '0.000000000']
i : 3 -  it : ""
alist is now ['44', '', '', '0.000000000']
 >>>

Ok, now for practical answers:

1/ in the above case, use line.strip().split(), you'll have no more 
problem !-)

2/ as a general rule, if you need to filter a sequence, don't try to do 
it in place (unless  it's a *very* big sequence and you run into memory 
problems but then there are probably better solutions).

The common idioms for filtering a sequence are:

* filter(predicate, sequence):

the 'predicate' param is callback function which takes an item from the 
sequence and returns a boolean value (True to keep the item, False to 
discard it). The following example will filter out even integers:

def is_odd(n):
    return n % 2

alist = range(10)
odds = filter(is_odd, alist)
print alist
print odds

Alternatively, filter() can take None as it's first param, in which case 
it will filter out items that have a false value in a boolean context, ie:

alist = ['', 'a', 0, 1, [], [1], None, object, False, True]
result = filter(None, alist)
print result

* list comprehensions

Here you directly build the result list:

alist = range(10)
odds = [n for n in alist if n % 2]

alist = ['', 'a', 0, 1, [], [1], None, object, False, True]
result = [item for item in alist if item]
print result

HTH