[Tutor] weird lambda expression -- can someone help me understand how this works

Sat Dec 14 05:19:21 CET 2013

On Fri, Dec 13, 2013 at 09:14:12PM -0500, Michael Crawford wrote:
> I found this piece of code on github
> 
> https://gist.github.com/kljensen/5452382
> 
> def one_hot_dataframe(data, cols, replace=False):
>     """ Takes a dataframe and a list of columns that need to be encoded.
>         Returns a 3-tuple comprising the data, the vectorized data,
>         and the fitted vectorizor.
>     """
>     vec = DictVectorizer()
>     mkdict = lambda row: dict((col, row[col]) for col in cols)  #<<<<<<<<<<<<<<<<<<
>     vecData = pandas.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
>     vecData.columns = vec.get_feature_names()
>     vecData.index = data.index
>     if replace is True:
>         data = data.drop(cols, axis=1)
>         data = data.join(vecData)
>     return (data, vecData, vec)
> 
> I don't understand how that lambda expression works.

Lambda is just syntactic sugar for a function. It is exactly the same as 
a def function, except with two limitations:

- there is no name, or to be precise, the name of all lambda functions 
is the same, "<lambda>";

- the body of the function is limited to exactly a single expression.

So we can take the lambda:

lambda row: dict((col, row[col]) for col in cols)

give it a more useful name, and turn it into this:

def mkdict(row):
    return dict((col, row[col]) for col in cols)

Now let's analyse that function. It takes a single argument, "row". That 
means that when you call the function, you have to provide a value for 
the row variable. To take a simpler example, when you call the len() 
function, you have to provide a value to take the length of!

len()
=> gives an error, because there's nothing to take the length of

len("abcdef")
=> returns 6

Same here with mkdict. It needs to be given a row argument. That is the 
responsibility of the caller, which we'll get to in a moment.

mkdict also has two other variables:

- col, which is defined inside the function, it is a loop variable 
  created by the "for col in cols" part;

- cols, which is taken from the one_hot_dataframe argument of the
  same name. Technically, this makes the mkdict function a so-called
  "closure", but don't worry about that. You'll learn about closures
  in due course.

[For pedants: technically, "dict" is also a variable, but that's not 
really important as Python ends up using the built-in dict function for 
that.]

> For starters where did row come from?  
> How did it know it was working on data?

To answer these questions, we have to look at the next line, where the 
mkdict function is actually used:

vecData = pandas.DataFrame(
            vec.fit_transform(
              data[cols].apply(mkdict, axis=1)
              ).toarray()
          )

I've spread that line over multiple physical lines to make it easier to 
read. The first think you'll notice is that it does a lot of work in a 
single call: it calls DataFrame, fit_transform, toarray, whatever they 
are. But the critical part for your question is the middle part:

data[cols].apply(mkdict, axis=1)

this extracts data[cols] (whatever that gives!) and then calls the 
"apply" method to it. I don't know what that actually is, I've never 
used pandas, but judging by the name I can guess that "apply" takes some 
sort of array of values:

    1  2  3  4  5  6
    7  8  9  10 11 12
    13 14 15 16 17 18 ...

extracts out either the rows (axis=1) or columns (axis=0 or axis=2 
perhaps?), and feeds them to a callback function. In this case, the 
callback function will be mkdict.

So, and remember this is just my guess based on the name, the apply 
method does something like this:

- extract row 1, giving [1, 2, 3] (or whatever the values happen 
  to be;

- pass that row to mkdict, giving mkdict([1, 2, 3]) which 
  calculates a dict {blah blah blah};

- stuffs that resulting dict somewhere for later use;

- do the same for row 2, then row 3, and so on.

That's my expectation.

-- 
Steven