[Tutor] decomposing a problem

Avi Gross avigross at verizon.net
Fri Dec 28 22:39:53 EST 2018


I will answer this question then head off on vacation.

I said that "I" have lots of experience dealing with data in dataframes.
Most of it is not in Python, obviously. It works well for some kinds of
data, not others. Let me not diverge from the purpose of the group by saying
a version of it is commonly used in the python module pandas. Python has
been extended in many ways including arrays in many dimensions in numpy.
There are also matrix types that are 2-D or more but all the above can only
hold one kind of data at a time. In real life, you have tabular data where
each column can be of any type.

There are quite a few implementations out there. Yes, your pointer to R
shows a basic version. But I note python has its own versions too.
Implementations may have some nice tools for taking subsets of rows or
columns, modifying existing structures and adding additional rows/columns
and iterating over them and generating all kinds of logical subsets. The
implementation may be as sort of a list of vectors of the same length. Lots
of flexibility if your data fits, not so much otherwise. It is one serious
reason R was used by so many for purposes like statistics and graphics but
python has been extended and now more and more people are using it for such
things as well as so many other things that R is not designed for. 

Just one point since it may be relevant. You can index a data.frame in ways
not that different than objects in python and one is to generate a condition
like "variable > 5" (or much more complex) that generates a Boolean vector
that can be used to take a subset of all rows into a copy for one purpose,
such as training in machine learning and simply negating the vector gets the
remaining ones. You can do similar things in python. My recent expertise and
experience relied heavily on the data structures R uses, where everything is
a vector and most operations are vectorized. In python, I am learning to do
things in whatever way works and that has been interesting too. 

You asked about the dictionary. R has things sort of similar but not quite.
It has environments that most people would not use as a dictionary. So in R,
I might have made a data.frame to hold key/value columns albeit keys would
not be forced to be unique. Thus selecting random keys would just mean
selecting a random number in the range of the number of rows and taking that
item in a key column. 

What you call a data.frame is not a data.frame. It is a list of two lists.
If you will pardon my putting python code here, it might look like this and
note numpy and pandas and other mathematical/scientific modules have plenty
of methods available. No real need for me to show you in R, unless offline.

import numpy as np
import pandas as pd

Your two forms of data will now be instantiated and then put into the
alternate formats to show some things.

>>> mydict = {'spam': 25, 'ham': 2, 'eggs': 7, 'cheddar': 1, 'brie': 14,
          'aardvark': 3, 'argument': 11, 'parrot': 16}

>>> df = [['spam', 'ham', 'eggs', 'cheddar', 'brie', 'aardvark', 'argument',
'parrot'],
      [25, 2, 7, 1, 14, 3, 11, 16]]

>>> mydict
		     
{'spam': 25, 'ham': 2, 'eggs': 7, 'cheddar': 1, 'brie': 14, 'aardvark': 3,
'argument': 11, 'parrot': 16}
>>> df
		     
[['spam', 'ham', 'eggs', 'cheddar', 'brie', 'aardvark', 'argument',
'parrot'], [25, 2, 7, 1, 14, 3, 11, 16]]

There are many ways to create a data.frame from lists, vectors,
dictionaries, ...

Your data is a bit skimpy for showing much but it can easily be imported:

>>> df_from_list = pd.DataFrame(data=df)
		     
>>> df_from_list
		     
      0    1     2        3     4         5         6       7
0  spam  ham  eggs  cheddar  brie  aardvark  argument  parrot
1    25    2     7        1    14         3        11      16

I would prefer to import this as columns:

I will first convert your list of 2 lists to a vertical format more suitable
in such tables:

>>> newdf = list(zip(df[0], df[1]))
>>> newdf
[('spam', 25), ('ham', 2), ('eggs', 7), ('cheddar', 1), ('brie', 14),
('aardvark', 3), ('argument', 11), ('parrot', 16)]

>>> vert_df = pd.DataFrame(newdf, columns=["FOOD", "AMOUNT"])
>>> vert_df
       FOOD  AMOUNT
0      spam      25
1       ham       2
2      eggs       7
3   cheddar       1
4      brie      14
5  aardvark       3
6  argument      11
7    parrot      16

It does not format well here but you see the two columns as well as a sort
of index.

You can take subsets of rows:

>>> vert_df[2:4]
      FOOD  AMOUNT
2     eggs       7
3  cheddar       1

You can select by conditions:

>> vert_df[vert_df['AMOUNT'] >= 7]
		    
       FOOD  AMOUNT
0      spam      25
2      eggs       7
4      brie      14
6  argument      11
7    parrot      16

And so on. The point is you can often read in data and manipulate it. R has
multiple sets of tools including one in what they call the tidyverse. In
English, given such a data structure with any number of rows and columns,
you have names for the columns and optionally the rows. The tools allow you
to select any combination of rows and columns based on all kinds of search
and matching criteria. You can re-arrange them, add new ones or create new
ones using the data in existing ones, generate all kinds of statistical info
such as the standard deviation of each column or apply your own functions.
All this can be done in a pipelined fashion.

What you often do is read in a data.frame from a Comma Separated Values file
(CSV) or all kinds of data from other programs including EXCEL spreadsheets,
Stata and so on, including the Feather format python can make, and massage
the data such as removing rows with any NA (not available) values, or
interpolate new values, split it into multiple dataframes as discussed and
so on. You can do many statistical analyses by feeding entire dataframes  or
selected subsets to functions to do many things like linear and other forms
of regression and it really shines when you feed these data structures to
graphics engines like ggplot2 letting you make amazing graphs. Like I said,
R is designed with vectors and data.frames as principal components.

But once python is augmented, it can do much of the same. Not quite sure how
much is ported or invented. Some data types like "formulas" seem to be done
differently. It will take me a while to study it all.

I can point to resources if anyone is interested but again, this is a python
forum. So it is of interest to me that it is possible to combine bits and
pieces of R and python in the same programming environment. I mean you can
use one to do what it does best, have the data structures silently be
translated into something the other one understands and do some more
processing where you have software that shines, then switch back and forth
as needed. This kind of duality may mean it is not necessary to keep
changing one language to be able to do what the other does, in some cases.
And, amusingly, much of the underlying functionality accessed is in C or C++
with some data structures being translated to/from the compiled C/C++
equivalents as you enter a function, them translated back at exit. 

This is not very deep so just making a point since Alan asked. You can find
strengths and weaknesses in any language. I love how python consistently
enough has everything being object-oriented. R started off without and has
grafted on at least a dozen variations which can be a tad annoying.



-----Original Message-----
From: Tutor <tutor-bounces+avigross=verizon.net at python.org> On Behalf Of
Steven D'Aprano
Sent: Friday, December 28, 2018 8:04 PM
To: tutor at python.org
Subject: Re: [Tutor] decomposing a problem

On Fri, Dec 28, 2018 at 03:34:19PM -0500, Avi Gross wrote:

[...]
> You replied to one of my points with this about a way to partition data:
> 
> ---
> The obvious solution:
> 
> keys = list(mydict.keys())
> random.shuffle(keys)
> index = len(keys)*3//4
> training_data = keys[:index]
> reserved = keys[index:]
> ---
> 
> (In the above, "---" is not python but a separator!)
> 
> That is indeed a very reasonable way to segment the data. But it sort 
> of makes my point. If the data is stored in a dictionary, the way to 
> access it ended up being to make a list and play with that. I would 
> still need to get the values one at a time from the dictionary such as 
> in the ways you also show and I omit.

Yes? How else do you expect to get the value given a key except by looking
it up?


> For me, it seems more natural in this case to simply have the data in 
> a data frame where I have lots of tools and methods available.


I'm not sure if your understanding of a data frame is the same as my
understanding. Are you talking about this?

http://www.r-tutor.com/r-introduction/data-frame

In other words, a two-dimensional array of some sort?

Okay, you have your data frame. Now what? How do you solve the problem
being asked? I'm not interested in vague handwaving that doesn't solve
anything. You specified data in a key:value store, let's say like this:


mydict = {'spam': 25, 'ham': 2, 'eggs': 7, 'cheddar': 1, 'brie': 14,
          'aardvark': 3, 'argument': 11, 'parrot': 16}

Here it is as a data frame:

df = [['spam', 'ham', 'eggs', 'cheddar', 'brie', 'aardvark', 'argument',
'parrot'],
      [25, 2, 7, 1, 14, 3, 11, 16]]

Now what? How do you randomly split that into randomly selected set of
training data and reserved data?

Feel free to give an answer in terms of R, provided you also give an
answer in terms of Python. Remember that unlike R, Python doesn't have a
standard data frame type, so you are responsible for building whatever
methods you need.




-- 
Steve
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list