data management with python from perl

Wed Oct 8 09:51:05 EDT 2003

bmoretti at chariot.net.au (ben moretti) wrote:

> i'm learning python, and one area i'd use it for is data management in
> scientific computing. in the case i've tried i want to reformat a data
> file from a normalised list to a matrix with some sorted columns. to
> do this at the moment i am using perl, which is very easy to do, and i
> want to see if python is as easy.

Not being too familiar with Perl (or scientific computing), I'm not
sure if I understood everything correctly...

> 1.00	1.00	1.00	"MO"	906.00	"genus species 1"	1.00
> 1.00	1.00	1.00	"MO"	906.00	"genus species 2"	1.00
> 1.00	1.00	1.00	"MO"	906.00	"genus species 3"	1.00
> 1.00	1.00	1.00	"MO"	906.00	"genus species 4"	1.00

I _think_ you want your data as a nested dictionary like so:
{1: {1: {1: {"MO": {906: {"genus species 1": 1,
                          "genus species 2": 1,
                          "genus species 3": 1,
                          "genus species 4": 1} }}}}}

> so, to do this in perl - and i won't bore you with the whole script -
> i read the file, split it into tokens

I hope I will NOT bore you with a whole script, but I've expanded your
data a bit to have a somewhat more complicated/structured data file to
work with (not shown here, this's more than long enough as it is); so
I'll first read it in and split it up:

###

import csv

f = open(r"i:\python\nestedtest.txt", "r") # my testdata
csvreader = csv.reader(f, delimiter=' ', quotechar='"')

###

>From your output I gather that maybe you the numbers as numbers, and
not as strings, so I'll convert the data while populating an
intermediate list:

###

def parselist(lst):
  """convert the list's values to floats or integers where
appropriate"""
  parsed = []
  for itm in lst:
    try:
      f = float(itm)
      i = int(f)
      if i == f:
        parsed.append(int(i))
      else:
        parsed.append(f)
    except ValueError:
      parsed.append(itm)
  return parsed

datalist = []
for line in csvreader:
  datalist.append(parselist(line))
f.close() # don't need that anymore

###

> and then populate a hash of
> hashes, the syntax of which is
> 
> $HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
> = $tokens[6]

Now, if that does what I think it does (create a nested hash), then
hats off to Perl! I haven't found anything as concise built into
Python (but then I'm not a guru, maybe someone else knows a better
way?), so I rolled my own:

###

def nestdict(lst):
  """create a recursively nested dictionary from a _flat_ list"""
  dct = {}
  if len(lst) > 2:
    dct[lst[0]] = nestdict(lst[1:])
  elif len(lst) == 2:
    dct[lst[0]] = lst[1]
  return dct

###

which is good for ONE line of input; since I have a list of those, I
want to build up the dictionary line by line, for which I need another
function:

###

def nestextend(dct, upd):
  """recursively extend/update a nested dictionary with another one"""
  try:
    items = upd.items()
    for key, val in items:
      if key not in dct:
        dct[key] = val
      else:
        nestextend(dct[key], upd[key])
  except AttributeError:
    dct.update(upd)

datadict = {}
for lst in datalist:
  nestextend(datadict, nestdict(lst))

###

datadict now holds all the data from the testfile in a nested
dictionary with the various locations and species values as the keys
of the hash, which is what (I hope) you wanted.

> and the abundance is the $tokens[6] value. this now gives me a
> multidimensional data structure

Reading that I'm not sure I've understood anything - shouldn't you
want to use a multidimensional array for that? Anyone familiar with
Python's scientific/number crunching/array libraries should be able to
clear that up...

> that i can use to loop over the keys and sort them by each as i go,
> then to write out the data into a matrix as above.

I'm not sure how you arrive at your matrix output, but looping over
the dictionary shouldn't be a problem now. However, since you also
want to sort the data (by key), and dictionaries notoriously don't
support that, I've written another function:

###

def nestsort(dct):
  """convert a nested dictionary to a nested (key, value) list,
  recursively sorting it by key"""
  lst = []
  try:
    items = dct.items()
    items.sort()
    for key, value in items:
      lst.append([key, nestsort(dct[key])])
    return lst
  except AttributeError:
    return dct

sorteddata = nestsort(datadict)

###

So now the data from the beginning looks like:

[1, [1, [1, ["MO", [906, ["genus species 1", 1],
                         ["genus species 2", 1],
                         ["genus species 3", 1],
                         ["genus species 4", 1] ]]]]]

which you probably could have had cheaper...

Now you can do something like:

###

for region, rdata in sorteddata:
  print "Region", region
  for location, ldata in rdata:
    print " " * 2 + "Location", location
    for site, sitedata in ldata:
      print " " * 4 + "Site", site
      for stand, stdata in sitedata:
        print " " * 6 + "Stand", stand
        for substrate, subdata in stdata:
          print " " * 8 + "Substrate", substrate
          for genus, abundance in subdata:
            print " " * 10 + "Genus", genus, "Abundance", abundance

###

to test my script and your (real) data.
There's next to no error-checking and it sure'd be more
pythonic/beautiful/reusable if I'd subclass'd dict, but it works --
for my data at least.

> ok. so how do i do this in python? i've tried the "perlish" way but

Once more, it seems that "the perlish way" <> "the python way".

> didn't get very far, however i know it must be able to be done!

I don't think there's much of anything either language can do that the
other can't, but of course some things are harder than others...

> if you want to respond to this, try benmoretti at yahoo dot com dot au
> as i get too much spam otherwise

<posted to the NG and forwarded to you>

--
Christopher