[Tutor] Syntax for Simplest Way to Execute One Python Program Over 1000's of Datasets

James Reynolds eire1130 at gmail.com
Thu Jun 9 23:05:48 CEST 2011


My advice would be to stay away from generic names, like:

for item in items:
   do stuff with item

For a couple of lines its ok, but when programs get large, your program will
get confusing even to you as the author.

Sometimes, it's best just to do "for all in listx: but I think that's rare.

Usually, you know what the contents of a list are, and you know what the
list is, so for example

for assets in portfolio:
   do stuff with assets

for loans in cdo:
   do stuff with loans

for protein in gen_sequence:
   do stuff with protein.

etc.

The other piece of advice I would give you is, make it more object oriented.
This should get you away from having to rename things, as you say, and use
self referential path finding mechanisms, such as os.getcwd() (get current
working directory) and use listdir as mentioned previously.



On Thu, Jun 9, 2011 at 4:30 PM, Corey Richardson <kb1pkl at aim.com> wrote:

> On 06/09/2011 03:49 PM, B G wrote:
> > I'm trying to analyze thousands of different cancer datasets and run the
> > same python program on them.  I use Windows XP, Python 2.7 and the IDLE
> > interpreter.  I already have the input files in a directory and I want to
> > learn the syntax for the quickest way to execute the program over all
> these
> > datasets.
> >
> > As an example,for the sample python program below, I don't want to have
> to
> > go into the python program each time and change filename and countfile.
>  A
> > computer could do this much quicker than I ever could.  Thanks in
> advance!
> >
>
> I think os.listdir() would be better for you than os.walk(), as Walter
> suggested, but if you have a directory tree, walk is better. Your file
> code could be simplified a lot by using context managers, which for a
> file looks like this:
>
> with open(filename, mode) as f:
>    f.write("Stuff!")
>
> f will automatically be closed and everything.
>
> Now for some code review!
>
> >
> > import string
> >
> > filename = 'draft1.txt'
> > countfile = 'draft1_output.txt'
> >
> > def add_word(counts, word):
> >     if counts.has_key(word):
> >         counts[word] += 1
> >     else:
> >         counts[word] = 1
> >
>
> See the notes on this later.
>
> > def get_word(item):
> >     word = ''
> >     item = item.strip(string.digits)
> >     item = item.lstrip(string.punctuation)
> >     item = item.rstrip(string.punctuation)
> >     word = item.lower()
> >     return word
>
> This whole function could be simplified to:
>
> return item.strip(string.digits + string.punctuation).lower()
>
> Note that foo.strip(bar) == foo.lstrip(bar).rstrip(bar)
>
> >
> >
> > def count_words(text):
> >     text = ' '.join(text.split('--')) #replace '--' with a space
>
> How about
>
> text = text.replace('--', ' ')
>
> >     items = text.split() #leaves in leading and trailing punctuation,
> >                          #'--' not recognised by split() as a word
> separator
>
> Or, items = text.split('--')
>
> You can specify the split string! You should read the docs on string
> methods:
> http://docs.python.org/library/stdtypes.html#string-methods
>
> >     counts = {}
> >     for item in items:
> >         word = get_word(item)
> >         if not word == '':
>
> That should be 'if word:', which just checks if it evaluates to True.
> Since the only string that evaluate to False is '', it makes the code
> shorter and more readable.
>
> >             add_word(counts, word)
> >     return counts
>
> A better way would be using a DefaultDict, like so:
>
> from collections import defaultdict
> [...]
>
> def count_words(text):
>    counts = defaultdict(int) # Every key starts off at 0!
>    items = text.split('--')
>     for item in items:
>        word = get_word(item)
>         if word:
>            counts[word] += 1
>    return counts
>
>
> Besides that things have a default value, a defaultdict is the same as
> any other dict. We pass 'int' as a parameter because defaultdict uses
> the parameter as a function for the default value. It works out because
> int() == 0.
>
> >
> > infile = open(filename, 'r')
> > text = infile.read()
> > infile.close()
>
> This could be:
>
> text = open(filename).read()
>
> When you're opening a file as 'r', the mode is optional!
>
> >
> > counts = count_words(text)
> >
> > outfile = open(countfile, 'w')
> > outfile.write("%-18s%s\n" %("Word", "Count"))
> > outfile.write("=======================\n")
>
> It may just be me, but I think
>
> outfile.write(('=' * 23) + '\n')
>
> looks better.
>
> >
> > counts_list = counts.items()
> > counts_list.sort()
> > for word in counts_list:
> >     outfile.write("%-18s%d\n" %(word[0], word[1]))
> >
> > outfile.close
>
> Parenthesis are important! outfile.close is a method object,
> outfile.close() is a method call. Context managers make this easy,
> because you don't have to manually close things.
>
> Hope it helped,
> --
> Corey Richardson
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110609/436bb1a6/attachment.html>


More information about the Tutor mailing list