Learning Python via a little word frequency program

Wed Jan 9 07:19:55 EST 2008

Andrew Savige a écrit :
> I'm learning Python by reading David Beazley's "Python Essential Reference"
> book and writing a few toy programs. To get a feel for hashes and sorting,
> I set myself this little problem today (not homework, BTW):
> 
>   Given a string containing a space-separated list of names:
> 
>     names = "freddy fred bill jock kevin andrew kevin kevin jock"
> 
>   produce a frequency table of names, sorted descending by frequency.
>   then ascending by name. For the above data, the output should be:
> 
>     kevin     : 3
>     jock      : 2
>     andrew    : 1
>     bill      : 1
>     fred      : 1
>     freddy    : 1
> 
> Here's my first attempt:
> 
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
> freq = {}
> for name in names.split():
>     freq[name] = 1 + freq.get(name, 0)
> deco = zip([-x for x in freq.values()], freq.keys())
> deco.sort()
> for v, k in deco:
>     print "%-10s: %d" % (k, -v)
> 
> I'm interested to learn how more experienced Python folks would solve
> this little problem. 

For a one-shot Q&D script:

names = "freddy fred bill jock kevin andrew kevin kevin jock"
freqs = [(- names.count(name), name) for name in set(names.split())]
print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))

Now I might choose a very different solution for a more serious 
application, depending on detailed specs and intended use of the 
"frequency table".

> Though I've read about the DSU Python sorting idiom,
> I'm not sure I've strictly applied it above ... 

Perhaps not "strictly" since you don't really "undecorate", but that's 
another application of the same principle : provided the appropriate 
data structure, sort() (or sorted()) will do the right thing.

> and the -x hack above to
> achieve a descending sort feels a bit odd to me, though I couldn't think
> of a better way to do it.

The "other" way would be to pass a custom comparison callback to sort, 
which would be both slower and more complicated. Your solution is IMHO 
the right thing to do here.

> I also have a few specific questions. Instead of:
> 
> for name in names.split():
>     freq[name] = 1 + freq.get(name, 0)
> 
> I might try:
> 
> for name in names.split():
>     try:
>         freq[name] += 1
>     except KeyError:
>         freq[name] = 1

or a couple other solutions, including a defaultdict (python >= 2.5).

> Which is preferred?

It's a FAQ - or it should be one. Globally: the second one tends to be 
faster when there's no exception (ie the key already exists), but slower 
when exceptions happen. So it mostly depends on what you expect your 
dataset to be.

Now note that you don't necessarily need a dict here !-)

> Ditto for:
> 
> deco = zip([-x for x in freq.values()], freq.keys())
> 
> versus:
> 
> deco = zip(map(operator.neg, freq.values()), freq.keys())

As far as I'm concerned, I'd favor the first solution here. Reads better 
IMHO

> Finally, I might replace:
> 
> for v, k in deco:
>     print "%-10s: %d" % (k, -v)
> 
> with:
> 
> print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)

That's what I'd do here too - but it depends on context (ie: for huge 
datasets and/or complex formating, i'd use a for loop).