[Tutor] Delete unwanted rows
Peter Otten
__peter__ at web.de
Wed Apr 9 16:14:01 CEST 2014
Alexis Prime wrote:
> Hello,
>
> My question is whether I should write a loop or a function to delete rows.
>
> I'm using pandas. But you may be able to help me as my question is about
> the reasoning behind programming.
>
> I have a pandas dataframe that looks like this, covering all countries in
> the world, for over 200 rows and many columns:
>
> Canada 20
> China 112
> Germany 10
> Japan 12
> Martinique 140
> Mexico 180
> Saint Kitts 90
> Saint Martins 133
> Saint Helena 166
> USA 18
>
> # So I write a list of small countries that I wish to exclude from my
> analysis. What I want to do is to delete the rows from my dataframe.
>
> toexclude = ['Martinique', 'Saint Kitts', 'Saint Martins', 'Saint
> Helena']
>
> After this, should I write a loop to loop through the dataframe, find the
> countries that I want to delete, and then delete the rows?
>
> Or should I write a function, which deletes those rows, and then returns
> me a new and trimmed dataframe?
>
> Thank you for helping me figure this out.
The dataset is so small that I would not spend much time on efficiency or
philosophical considerations, and use the solution that is easiest to code.
In generic Python this means typically iterating over the data and building
a new list (I'm sorry, I don't know how this translates into pandas):
data = [
("Canada", 20),
("China", 112),
...
]
excluded = {"Martinique", "Saint Kitts", ...}
cleaned_data = [row for row in data if row[0] not in excluded]
However, pandas is a specialist topic and if you expect to work more with it
you may want to learn the proper idiomatic way to do it. You should then
ask again on a mailing list that is frequented by the pandas experts --
python-tutor is mostly about the basics of generic Python.
Finally, as someone who knows Python well, just a little numpy, and nothing
about pandas I decided to bang my head against the wall a few times until I
came up with the following hack:
import pandas
data = """\
Canada 20
China 112
Germany 10
Japan 12
Martinique 140
Mexico 180
Saint Kitts 90
Saint Martins 133
Saint Helena 166
USA 18
"""
rows = [line.rsplit(None, 1) for line in data.splitlines()]
names = [row[0] for row in rows]
values = [int(row[1]) for row in rows]
df = pandas.DataFrame(dict(name=names, value=values))
class Contain:
def __init__(self, items):
self.items = set(items)
def __ne__(self, other):
return other not in self.items
def __eq__(self, other):
return other in self.items
exclude = Contain([
'Martinique',
'Saint Kitts',
'Saint Martins',
'Saint Helena'])
cleaned_df = df[df.name != exclude]
print("Before:")
print(df)
print()
print("After:")
print(cleaned_df)
[The trick here is to convert the "in" into the "==" operator because the
latter can return arbitrary objects while the former is limited to bool]
More information about the Tutor
mailing list