[Tutor] Delete unwanted rows

Wed Apr 9 16:14:01 CEST 2014

Alexis Prime wrote:

> Hello,
> 
> My question is whether I should write a loop or a function to delete rows.
> 
> I'm using pandas. But you may be able to help me as my question is about
> the reasoning behind programming.
> 
> I have a pandas dataframe that looks like this, covering all countries in
> the world, for over 200 rows and many columns:
> 
> Canada                20
> China                  112
> Germany             10
> Japan                  12
> Martinique             140
> Mexico                180
> Saint Kitts            90
> Saint Martins        133
> Saint Helena         166
> USA                    18
> 
> # So I write a list of small countries that I wish to exclude from my
> analysis. What I want to do is to delete the rows from my dataframe.
> 
>     toexclude = ['Martinique', 'Saint Kitts', 'Saint Martins', 'Saint
> Helena']
> 
> After this, should I write a loop to loop through the dataframe, find the
> countries that I want to delete, and then delete the rows?
> 
> Or should I write a function, which deletes those rows, and then returns
> me a new and trimmed dataframe?
> 
> Thank you for helping me figure this out.

The dataset is so small that I would not spend much time on efficiency or 
philosophical considerations, and use the solution that is easiest to code. 
In generic Python this means typically iterating over the data and building 
a new list (I'm sorry, I don't know how this translates into pandas):

data = [
   ("Canada", 20),
   ("China", 112),
   ...
]
excluded = {"Martinique", "Saint Kitts", ...}
cleaned_data = [row for row in data if row[0] not in excluded]

However, pandas is a specialist topic and if you expect to work more with it 
you may want to learn the proper idiomatic way to do it. You should then  
ask again on a mailing list that is frequented by the pandas experts -- 
python-tutor is mostly about the basics of generic Python.

Finally, as someone who knows Python well, just a little numpy, and nothing 
about pandas I decided to bang my head against the wall a few times until I 
came up with the following hack:

import pandas
data = """\
Canada                20
China                  112
Germany             10
Japan                  12
Martinique             140
Mexico                180
Saint Kitts            90
Saint Martins        133
Saint Helena         166
USA                    18
"""
rows = [line.rsplit(None, 1) for line in data.splitlines()]
names = [row[0] for row in rows]
values = [int(row[1]) for row in rows]

df = pandas.DataFrame(dict(name=names, value=values))

class Contain:
    def __init__(self, items):
        self.items = set(items)
    def __ne__(self, other):
        return other not in self.items
    def __eq__(self, other):
        return other in self.items

exclude = Contain([
        'Martinique',
        'Saint Kitts',
        'Saint Martins',
        'Saint Helena'])

cleaned_df = df[df.name != exclude]

print("Before:")
print(df)
print()
print("After:")
print(cleaned_df)

[The trick here is to convert the "in" into the "==" operator because the 
latter can return arbitrary objects while the former is limited to bool]