[Tutor] Translating R Code to Python-- reading in csv files, writing out to csv files
Martin A. Brown
martin at linux-ip.net
Sat May 19 23:32:39 CEST 2012
Greetings Benjamin,
To begin: I do not know R.
: I'm trying to improve my python by translating R code that I
: wrote into Python.
:
: *All I am trying to do is take in a specific column in
: "uncurated" and write that whole column as output to "curated."
: It should be a pretty basic command, I'm just not clear on how to
: execute it.*
The hardest part about translation is learning how to think in a
different language. If you know any other human languages, you
probably know that you can say things in some languages that do not
translate particularly well (other than circumlocution) into another
language. Why am I starting with this? I am starting here because
you seem quite comfortable with thinking and operating in R, but you
don't seem as comfortable yet with thinking and operating in Python.
Naturally, that's why you are asking the Tutor list about this, so
welcome to the right place! Let's see if we can get you some help.
: As background, GSEXXXXX_full_pdata.csv has different patient
: information (such as unique patient ID's, whether the tissue used
: was tumor or normal, and other things. I'll just use the first
: two characteristics for now). Template.csv is a template we built
: that allows us to take different datasets and standardize them
: for meta-analysis. So for example, "curated$alt_sample_name"
: refers to the unique patient ID, and "curated$sample_type" refers
: to the type of tissue used.
I have fabricated some data after your description that looks like
this:
patientID,title,sample_type
V6IF0OqVu,0.5788,70
GXj51ljB2,0.3449,88
You, doubtless have more columns and the data here are probably
nothing like yours, but consider it useful for illustrative purposes
only. (Illustrating porpoises! How did they get here? Next thing
you know we will have illuminating egrets and animating
dromedaries!)
: I've been reading about the python csv module and realized it was
: best to get some expert input to clarify some confusion on my
: part.
The csv module is very useful and quite powerful for reading data in
different ways and iterating over data sets. Supposing you know the
index of the column of interest to you...well this is quite trivial:
import csv
def main(f,field):
for row in csv.reader(f):
print row[0],row[field]
# -- lists/tuples are zero-based [0,1,2], so 2 is the third column
#
#
main(open('GSEXXXXX_full_pdata.csv'),2)
OK, but if your data files have different numbers of or ordering of
columns, then this can become a bit fragile. So maybe you would
want to learn how to use the csv.DictReader, which will give you the
same thing but uses the first (header) line to name the columns, so
then you could do something more like this:
import csv
def main(f,id,field):
for row in csv.DictReader(f):
print row[id],row[field]
main(open('GSEXXXXX_full_pdata.csv'),'patientID','sample_type')
Would you like more detail on this? Well, have a look at this nice
little summary:
http://www.doughellmann.com/PyMOTW/csv/
Now, that really is just giving you a glimpse of the csv module.
This is not really your question. Your question was more along the
lines of 'How do I, in Python, accomplish this task that is quite
simple in R?'
You may find that list-comprehensions, generators and iterators are
all helpful in mangling the data according to your nefarious will
once you have used the csv module to load the data into a data
structure.
In point of fact, though, Python does not have this particular
feature that you are seek...not in the core libraries, however.
The lack of this capability has bothered a few people over the
years, so there are a few different types of solutions. You have
already heard a reference to RPy (about which I know nothing):
http://rpy.sourceforge.net/
There are, however, a few other tools that you may find quite
useful. One chap wanted access to some features of R that he used
all the time along with many of the other convenient features of
Python, so he decided to implement dataframes (an R concept?) in
Python. This idea was present at the genesis of the pandas library.
http://pandas.pydata.org/
So, how would you do this with pandas? Well, you could:
import pandas
def main(f,field):
uncurated = pandas.read_csv(f)
curated = uncurated[field]
print curated
main(open('GSEXXXXX_full_pdata.csv'),'sample_type')
Note that pandas is geared to allow you to access your data by the
'handles', the unique identifier for the row and the column name.
This will produce a tabular output of just the single column you
want. You may find that pandas affords you access to tools with
which you are already intellectually familiar.
Good luck,
-Martin
P.S. While I was writing this, you sent in some sample data that
looked tab-separated (well, anyway, not comma-separated). The
csv and pandas libraries allow for delimiter='\t' options to
most object constructor calls. So, you could do:
csv.reader(f,delimiter='\t')
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list