[Tutor] weird lambda expression -- can someone help me understand how this works
Steven D'Aprano
steve at pearwood.info
Sat Dec 14 05:19:21 CET 2013
On Fri, Dec 13, 2013 at 09:14:12PM -0500, Michael Crawford wrote:
> I found this piece of code on github
>
> https://gist.github.com/kljensen/5452382
>
> def one_hot_dataframe(data, cols, replace=False):
> """ Takes a dataframe and a list of columns that need to be encoded.
> Returns a 3-tuple comprising the data, the vectorized data,
> and the fitted vectorizor.
> """
> vec = DictVectorizer()
> mkdict = lambda row: dict((col, row[col]) for col in cols) #<<<<<<<<<<<<<<<<<<
> vecData = pandas.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
> vecData.columns = vec.get_feature_names()
> vecData.index = data.index
> if replace is True:
> data = data.drop(cols, axis=1)
> data = data.join(vecData)
> return (data, vecData, vec)
>
> I don't understand how that lambda expression works.
Lambda is just syntactic sugar for a function. It is exactly the same as
a def function, except with two limitations:
- there is no name, or to be precise, the name of all lambda functions
is the same, "<lambda>";
- the body of the function is limited to exactly a single expression.
So we can take the lambda:
lambda row: dict((col, row[col]) for col in cols)
give it a more useful name, and turn it into this:
def mkdict(row):
return dict((col, row[col]) for col in cols)
Now let's analyse that function. It takes a single argument, "row". That
means that when you call the function, you have to provide a value for
the row variable. To take a simpler example, when you call the len()
function, you have to provide a value to take the length of!
len()
=> gives an error, because there's nothing to take the length of
len("abcdef")
=> returns 6
Same here with mkdict. It needs to be given a row argument. That is the
responsibility of the caller, which we'll get to in a moment.
mkdict also has two other variables:
- col, which is defined inside the function, it is a loop variable
created by the "for col in cols" part;
- cols, which is taken from the one_hot_dataframe argument of the
same name. Technically, this makes the mkdict function a so-called
"closure", but don't worry about that. You'll learn about closures
in due course.
[For pedants: technically, "dict" is also a variable, but that's not
really important as Python ends up using the built-in dict function for
that.]
> For starters where did row come from?
> How did it know it was working on data?
To answer these questions, we have to look at the next line, where the
mkdict function is actually used:
vecData = pandas.DataFrame(
vec.fit_transform(
data[cols].apply(mkdict, axis=1)
).toarray()
)
I've spread that line over multiple physical lines to make it easier to
read. The first think you'll notice is that it does a lot of work in a
single call: it calls DataFrame, fit_transform, toarray, whatever they
are. But the critical part for your question is the middle part:
data[cols].apply(mkdict, axis=1)
this extracts data[cols] (whatever that gives!) and then calls the
"apply" method to it. I don't know what that actually is, I've never
used pandas, but judging by the name I can guess that "apply" takes some
sort of array of values:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18 ...
extracts out either the rows (axis=1) or columns (axis=0 or axis=2
perhaps?), and feeds them to a callback function. In this case, the
callback function will be mkdict.
So, and remember this is just my guess based on the name, the apply
method does something like this:
- extract row 1, giving [1, 2, 3] (or whatever the values happen
to be;
- pass that row to mkdict, giving mkdict([1, 2, 3]) which
calculates a dict {blah blah blah};
- stuffs that resulting dict somewhere for later use;
- do the same for row 2, then row 3, and so on.
That's my expectation.
--
Steven
More information about the Tutor
mailing list