[Tutor] Translating R Code to Python-- reading in csv files, writing out to csv files

Sat May 19 23:32:39 CEST 2012

Greetings Benjamin,

To begin: I do not know R.

 : I'm trying to improve my python by translating R code that I 
 : wrote into Python. 
 :
 : *All I am trying to do is take in a specific column in 
 : "uncurated" and write that whole column as output to "curated." 
 : It should be a pretty basic command, I'm just not clear on how to 
 : execute it.*

The hardest part about translation is learning how to think in a 
different language.  If you know any other human languages, you 
probably know that you can say things in some languages that do not 
translate particularly well (other than circumlocution) into another 
language.  Why am I starting with this?  I am starting here because 
you seem quite comfortable with thinking and operating in R, but you 
don't seem as comfortable yet with thinking and operating in Python.

Naturally, that's why you are asking the Tutor list about this, so 
welcome to the right place!  Let's see if we can get you some help.

 : As background, GSEXXXXX_full_pdata.csv has different patient 
 : information (such as unique patient ID's, whether the tissue used 
 : was tumor or normal, and other things. I'll just use the first 
 : two characteristics for now). Template.csv is a template we built 
 : that allows us to take different datasets and standardize them 
 : for meta-analysis.  So for example, "curated$alt_sample_name" 
 : refers to the unique patient ID, and "curated$sample_type" refers 
 : to the type of tissue used. 

I have fabricated some data after your description that looks like 
this:

  patientID,title,sample_type
  V6IF0OqVu,0.5788,70
  GXj51ljB2,0.3449,88

You, doubtless have more columns and the data here are probably 
nothing like yours, but consider it useful for illustrative purposes 
only.  (Illustrating porpoises!  How did they get here?  Next thing 
you know we will have illuminating egrets and animating 
dromedaries!)

 : I've been reading about the python csv module and realized it was 
 : best to get some expert input to clarify some confusion on my 
 : part. 

The csv module is very useful and quite powerful for reading data in 
different ways and iterating over data sets.  Supposing you know the 
index of the column of interest to you...well this is quite trivial:

  import csv
  def main(f,field):
      for row in csv.reader(f):
          print row[0],row[field]

  # -- lists/tuples are zero-based [0,1,2], so 2 is the third column
  #    
  #
  main(open('GSEXXXXX_full_pdata.csv'),2)  

OK, but if your data files have different numbers of or ordering of 
columns, then this can become a bit fragile.  So maybe you would 
want to learn how to use the csv.DictReader, which will give you the 
same thing but uses the first (header) line to name the columns, so 
then you could do something more like this:

  import csv
  def main(f,id,field):
      for row in csv.DictReader(f):
          print row[id],row[field]

  main(open('GSEXXXXX_full_pdata.csv'),'patientID','sample_type')

Would you like more detail on this?  Well, have a look at this nice 
little summary:

  http://www.doughellmann.com/PyMOTW/csv/

Now, that really is just giving you a glimpse of the csv module.  
This is not really your question.  Your question was more along the 
lines of 'How do I, in Python, accomplish this task that is quite 
simple in R?' 

You may find that list-comprehensions, generators and iterators are 
all helpful in mangling the data according to your nefarious will 
once you have used the csv module to load the data into a data 
structure.

In point of fact, though, Python does not have this particular 
feature that you are seek...not in the core libraries, however.

The lack of this capability has bothered a few people over the 
years, so there are a few different types of solutions.  You have 
already heard a reference to RPy (about which I know nothing):

  http://rpy.sourceforge.net/

There are, however, a few other tools that you may find quite 
useful.  One chap wanted access to some features of R that he used 
all the time along with many of the other convenient features of 
Python, so he decided to implement dataframes (an R concept?) in 
Python.  This idea was present at the genesis of the pandas library.

  http://pandas.pydata.org/

So, how would you do this with pandas?  Well, you could:

  import pandas
  def main(f,field):
      uncurated = pandas.read_csv(f)
      curated = uncurated[field]
      print curated

  main(open('GSEXXXXX_full_pdata.csv'),'sample_type')

Note that pandas is geared to allow you to access your data by the 
'handles', the unique identifier for the row and the column name. 
This will produce a tabular output of just the single column you 
want.  You may find that pandas affords you access to tools with 
which you are already intellectually familiar.

Good luck,

-Martin

P.S. While I was writing this, you sent in some sample data that 
   looked tab-separated (well, anyway, not comma-separated).  The 
   csv and pandas libraries allow for delimiter='\t' options to
   most object constructor calls.  So, you could do:
     csv.reader(f,delimiter='\t')

-- 
Martin A. Brown
http://linux-ip.net/