data management with python from perl
Christopher Koppler
klapotec at chello.at
Wed Oct 8 09:51:05 EDT 2003
bmoretti at chariot.net.au (ben moretti) wrote:
> i'm learning python, and one area i'd use it for is data management in
> scientific computing. in the case i've tried i want to reformat a data
> file from a normalised list to a matrix with some sorted columns. to
> do this at the moment i am using perl, which is very easy to do, and i
> want to see if python is as easy.
Not being too familiar with Perl (or scientific computing), I'm not
sure if I understood everything correctly...
> 1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
> 1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
> 1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
> 1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00
I _think_ you want your data as a nested dictionary like so:
{1: {1: {1: {"MO": {906: {"genus species 1": 1,
"genus species 2": 1,
"genus species 3": 1,
"genus species 4": 1} }}}}}
> so, to do this in perl - and i won't bore you with the whole script -
> i read the file, split it into tokens
I hope I will NOT bore you with a whole script, but I've expanded your
data a bit to have a somewhat more complicated/structured data file to
work with (not shown here, this's more than long enough as it is); so
I'll first read it in and split it up:
###
import csv
f = open(r"i:\python\nestedtest.txt", "r") # my testdata
csvreader = csv.reader(f, delimiter=' ', quotechar='"')
###
>From your output I gather that maybe you the numbers as numbers, and
not as strings, so I'll convert the data while populating an
intermediate list:
###
def parselist(lst):
"""convert the list's values to floats or integers where
appropriate"""
parsed = []
for itm in lst:
try:
f = float(itm)
i = int(f)
if i == f:
parsed.append(int(i))
else:
parsed.append(f)
except ValueError:
parsed.append(itm)
return parsed
datalist = []
for line in csvreader:
datalist.append(parselist(line))
f.close() # don't need that anymore
###
> and then populate a hash of
> hashes, the syntax of which is
>
> $HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
> = $tokens[6]
Now, if that does what I think it does (create a nested hash), then
hats off to Perl! I haven't found anything as concise built into
Python (but then I'm not a guru, maybe someone else knows a better
way?), so I rolled my own:
###
def nestdict(lst):
"""create a recursively nested dictionary from a _flat_ list"""
dct = {}
if len(lst) > 2:
dct[lst[0]] = nestdict(lst[1:])
elif len(lst) == 2:
dct[lst[0]] = lst[1]
return dct
###
which is good for ONE line of input; since I have a list of those, I
want to build up the dictionary line by line, for which I need another
function:
###
def nestextend(dct, upd):
"""recursively extend/update a nested dictionary with another one"""
try:
items = upd.items()
for key, val in items:
if key not in dct:
dct[key] = val
else:
nestextend(dct[key], upd[key])
except AttributeError:
dct.update(upd)
datadict = {}
for lst in datalist:
nestextend(datadict, nestdict(lst))
###
datadict now holds all the data from the testfile in a nested
dictionary with the various locations and species values as the keys
of the hash, which is what (I hope) you wanted.
> and the abundance is the $tokens[6] value. this now gives me a
> multidimensional data structure
Reading that I'm not sure I've understood anything - shouldn't you
want to use a multidimensional array for that? Anyone familiar with
Python's scientific/number crunching/array libraries should be able to
clear that up...
> that i can use to loop over the keys and sort them by each as i go,
> then to write out the data into a matrix as above.
I'm not sure how you arrive at your matrix output, but looping over
the dictionary shouldn't be a problem now. However, since you also
want to sort the data (by key), and dictionaries notoriously don't
support that, I've written another function:
###
def nestsort(dct):
"""convert a nested dictionary to a nested (key, value) list,
recursively sorting it by key"""
lst = []
try:
items = dct.items()
items.sort()
for key, value in items:
lst.append([key, nestsort(dct[key])])
return lst
except AttributeError:
return dct
sorteddata = nestsort(datadict)
###
So now the data from the beginning looks like:
[1, [1, [1, ["MO", [906, ["genus species 1", 1],
["genus species 2", 1],
["genus species 3", 1],
["genus species 4", 1] ]]]]]
which you probably could have had cheaper...
Now you can do something like:
###
for region, rdata in sorteddata:
print "Region", region
for location, ldata in rdata:
print " " * 2 + "Location", location
for site, sitedata in ldata:
print " " * 4 + "Site", site
for stand, stdata in sitedata:
print " " * 6 + "Stand", stand
for substrate, subdata in stdata:
print " " * 8 + "Substrate", substrate
for genus, abundance in subdata:
print " " * 10 + "Genus", genus, "Abundance", abundance
###
to test my script and your (real) data.
There's next to no error-checking and it sure'd be more
pythonic/beautiful/reusable if I'd subclass'd dict, but it works --
for my data at least.
> ok. so how do i do this in python? i've tried the "perlish" way but
Once more, it seems that "the perlish way" <> "the python way".
> didn't get very far, however i know it must be able to be done!
I don't think there's much of anything either language can do that the
other can't, but of course some things are harder than others...
> if you want to respond to this, try benmoretti at yahoo dot com dot au
> as i get too much spam otherwise
<posted to the NG and forwarded to you>
--
Christopher
More information about the Python-list
mailing list