[Tutor] help for dictionary sorting program

Jeff Shannon jeff@ccvcorp.com
Fri Apr 18 13:22:15 2003


Abdirizak abdi wrote:

> Hi,
> I was working on a function that takes a adictionary as an argument 
> and returns the frequency ordering of elements in the dictionary. I 
> have dictionary set up as follows:
>  
> Value         key        frequency   positions where the word was found
> natural    doc1.txt   (89,          (17, 22, 26))


This doesn't really show how your dictionary is defined.  You say that 
'doc1.txt' is the key, but I'd presume that you'll have many words from 
doc1.txt, and you'll want a dictionary listing all of them.  That 
indicates that you want to key the dictionary off of the word ('natural' 
in this case) instead of the filename.  I'd probably implement the 
dictionary something like this:


freq_dict = { 'natural' :  ('doc1.txt', 89, [17,22,26]), ...]

Each item in the dictionary is keyed off of the word, and the value is a 
tuple containing the filename, the frequency number within that file, 
and a list of the word positions.  If you want to be able to track word 
frequency in multiple files, you could fairly easily extend this to be a 
list of such tuples, one for each file, or perhaps even a nested dict 
keyed off of the filename.

> my intention is to pick-up the items in the dictionary by frequency [...]


So you want to get a list of all of the key:value pairs, sorted by 
value[1] ...

> def SortDict(adict):
>         """ this function sorts the frequency/key pairs in the
>             dictionary by highest to lowest frequency   """
>         counts = {}
>         for word in adict.items():
>                 if counts.has_key(word):
>                         counts[word] += 1
>                 else:


I'm guessing that you didn't include the actual else: clause in order to 
save space, since having nothing following the else: will result in a 
syntax error.  The problem here is that you're asking for adict.items(), 
which returns a list of [key, value] pairs, so that word is actually a 
2-element list.  It looks like you probably want to use adict.keys() 
instead, which will give you a list of only the keys in adict.  You 
could also write it as

    for word, value in adict.items():

and take advantage of automatic tuple unpacking.  Now, given the example 
dictionary I showed above, word will be 'natural' and value will be 
"('doc1.txt', 89, [17,22,26])".

However, I'm not sure what you're trying to accomplish here.  It looks 
like you're trying to get a count of the number of times each word 
occurs in adict.  But because dictionary keys must be unique, you're 
guaranteed that each word in adict occurs only once.

If you're trying to sort adict by the frequency of words, remember that 
you've already got that frequency as part of the value.  You just need 
to reorganize things so that you can use the built-in list sort(). 
 Remember what I said about needing to sort on value[1]?  We just need 
to create a list where value[1] is the first part of each element.  If 
we're iterating through adict.items(), then value is the second part of 
each item (item[1]).  So we build a list that contains (freq, item), 
sort that list, then discard freq and keep item.  (This procedure is 
frequently called decorate-sort-undecorate.)

def sort_by_freq(adict):
    templist = []
    for item in adict.items():
        element = (item[1][1], item)
        templist.append( element )
    templist.sort()
    result = []
    for element in templist:
        result.append( element[1] )
    return result

We can do this a bit more compactly by using list comprehensions:

def sort_by_freq(adict):
    templist = [ (item[1][1], item) for item in adict.items() ]
    templist.sort()
    result = [element[1] for element in templist]
    return result

The result that's returned by this will now be a list of (key, value) 
pairs, where value is the tuple I described above, with frequency as its 
middle element.  You can then print the first five elements:

sorted_list = sort_by_freq(freq_dict)
for line in sorted_list[:5]:
    print "word: %10s  freq: %4d" % (line[0], line[1][1])

Hope this helps...

Jeff Shannon
Technician/Programmer
Credit International