[Tutor] Pythonic way

Tue Nov 20 13:08:35 EST 2018

This is not a question or reply. Nor is it short. If not interested, feel free to delete.

It is an observation based on recent experiences.

We have had quite a few messages that pointed out how some people approach  solving a problem using subconscious paradigms inherited from their past. This includes people who never programmed and are thinking of how they might do it manually as well as people who are proficient in one or more other computer languages and their first attempt to visualize the solution may lead along paths that are possibly doable in Python but not optimal or even suggested.

I recently had to throw together a routine that would extract info from multiple SAS data files and join them together on one key into a large DataFrame (or data.frame or other such names for a tabular object.). Then I needed to write them out to disk either as a CSV or XLSX file for future use.

Since I have studied and used (not to mention abused) many programming languages, my first thought was to do this in R. It has lots of the tools needed to do such things including packages (sort of like modules you can import but not exactly) and I have done many data/graphics programs in it. I then redid it in Python after some thought.

The pseudocode outline is:

*	Read in all the files into a set of data.frame objects.
*	Trim back the variables/columns of some of them as many are not needed.
*	Join them together on a common index using a full or outer join.
*	Save the result on disk as a Comma Separated Values file.
*	Save the result on disk as a named tab in a new style EXCEL file.

I determined some of what I might us such as the needed built-in commands, packages and functions I could use for the early parts but ran into an annoyance as some of the files contained duplicate entries. Luckily, the R function reduce (not the same as map/reduce) is like many things in R and takes a list of items and makes it work. Also, by default, it renames duplicates so if you have ALPHA in multiple places, it names them ALPHA.x and ALPHA.x.x and other variations.

df_joined <- reduce(df_list, full_join, by = "COMMON")

Mind you, when I stripped away many of the columns not needed in some of the files, there were fewer duplicates and a much smaller output file.

But I ran into a wall later. Saving into a CSV is trivial. There are multiple packages meant to be used for saving into a XLSX file but they all failed for me. One wanted something in JAVA and another may want PERL and some may want packages I do not have installed. So, rather than bash my head against the wall, Itwisted and used the best XSLX maker there is. I opened the CSV file in EXCEL manually and did a SAVE AS …

Then I went to plan C (no, not the language C or its many extensions like C++) as I am still learning Python and have not used it much. As an exercise, I decided to learn how to do this in Python using tools/modules like numpy and pandas that I have not had much need for as well as additional tools for reading and writing files in other formats.

My first attempts gradually worked, after lots of mistakes and looking at manual pages. It followed an eclectic set of paradigms but worked.  Not immediately, as I ran into a problem in that the pandas version of a join did not tolerate duplicate column names when used on a list. I could get it to rename the left or right list (adding a suffix a suffix) when used on exactly two DataFrames. So, I needed to take the first df and do a df.join(second, …) then take that and join the third and so on. I also needed to keep telling it to set the index to the common value for each and every df including the newly joined series. And, due to size, I chose to keep deleting df no longer in use but that would not be garbage collected.

I then looked again at how to tighten it up in a more pythonic way. In English (my sixth language since we are talking about languages 😉 ) I did some things linearly then shifted it to a list method. I used lists of file names and lists of the df made from each file after removing unwanted columns. (NOTE: I use “column” but depending on language and context I mean variable or field or axis or many other ways to say a group of related information in a tabular structure that crosses rows or instances.)

So I was able to do my multi-step join more like this:

join_with_list = dflist[1:]

current = df1

suffix = 1

for df in join_with_list:

    current = current.join(df, how='outer', rsuffix='_'+str(suffix))

    suffix += 1

    current.set_index('ID')

In this formulation, the intermediate DataFrame objects held in current will silently be garbage collected as nothing points to them, for example. Did I mention these were huge files?

The old code was much longer and error prone as I had a df1, df2, … df8 as well as other intermediates and was easy to copy and paste then not edit the changes properly.

On to the main point. Saving as a CSV was trivial. Saving as an XLSX took some work BECAUSE I had pandas but apparently was missing some components it needed. I had to figure out what was needed and get it and finally got it working nicely.

In this case, I am sure I might have been able to figure out how to make my R environment work in my Windows 10 machine or installed the other software needed or move my development into Cygwin or even used Linux. 

Now for the main point. As hinted above, sometimes the functionality in a particular programming language like R or Python already requires the use of external resources that can include parts made in other languages. Python makes extensive use of internal parts being made in C for speed and the interpreter itself is in C or C++. Similarly R uses C and C++. So the next step can be to integrate the languages. They have strengths and weaknesses and I know both are used heavily in Machine Learning which is part of my current focus. Some things are best done when your language matches the method/algorithm you use. R has half a dozen object-oriented variations and generally they are not as useful as the elegant and consistent OO model in Python. But it has strengths that let me make weird graphics in at least three different major philosophies (base, lattice and ggplot) and lots of nice things and ways based on the underlying philosophy being that everything is a vector and most operations are vectorized or can be. Handy. It can lead to modes of thinking about a problem quite different than the pythonic way.

So, I can now load an R package that lets me shift back and forth in the same session to a now embedded python Interpreter. Many “objects” can be created and processed in one language, then passed through the veil to functionality in the other and back and forth. You can load modules within the python component and load packages in the R component. Within reason, you use the best tools for the job. If part is best done with sets or dictionaries or generators, do it on the python side. Want to use ggplot2 for graphics, do it on the R side.

In reality, there are many more programming languages (especially dormant ones) than are really needed. But they evolve and some are designed more for some tasks than others. So, if you want to use Python, why not see if you can use it the way it is loosely intended. If all you are doing is rewriting it to fit the mold of another language, why not use that, if still available.

Do I have a favorite language? Not really. I note that in attempts to improve Python (and other languages too) over the years, they keep adding and often in ways that sort of change enough so there is no longer as clear a philosophy. I can see heavy borrowing from many horse breeds and making a camel. So there isn’t really ONE pythonic way for many things. We have two completely separate ways to format strings that end up with fairly similar functionality. Actually, there is an implicit third way 😊

So I think there are multiple overlapping sets of what it means to be pythonic. If you come from a procedural background, you can do much without creating objects or using functional programming skills. If you come from an OO background, you can have fun making endless classes and subclasses to do just about everything including procedural things the hard way by creating one master object that controls lots of lave objects. If you lie functional programming with factories that churn out functions that may retain their variables, again, have fun. There are other such paradigms supported including lots of miniature sub-languages you can create with regular expressions being an example, as well as the print formatting methods. To be fluent in python, though, you need to be able to use other people’s existing code and perhaps be able to modify or emulate it. That effectively means being open to multiple ways so in a sense, being pythonic includes being flexible, to a point.

Good place to stop and resume my previously scheduled life.