Smart text parsing

Thu Feb 5 23:17:39 EST 2004

Mathias Mamsch wrote:

> I got a text with about 1 million words where I want to count words and put
> them sorted to a list
> like " list = [(most-common-word,1001),(2nd-word,986), ...] "
> 
> I think there are at about 10% (about 100.000) different words in the text.
> 
> I am wondering if you can give me something faster than my approach:
> My first straightforward approach was:
> ----
> s = "Hello this is my 1 million word text".split()
> 
> s2 = s.split()
> dict = {}
> for i in s2:         # the loop needs 10s
>         if dict.has_key(i):
>                 dict[i] += 1
>         else:
>                 dict[i] = 1
> list = dict.items()
> #   this is slow:
> list.sort(lambda x,y: 2*(x[1] < y[1])-1)
> ----

Passing a comparison function to sort slows things down a lot.  Try something 
like this instead:

parts = "Hello this is my 1 million word text".split()
for part in parts:
     if d.has_key(part):
          d[part] += 1
     else:
         d[part] = 1

lst = d.items()
lst = [(t[1], t[0]) for t in lst]  # (frequency, string)
lst.sort()  # sort as usual
lst.reverse()  # reverse, so highest numbers are first

HTH,

-- 
Hans (hans at zephyrfalcon.org)
http://zephyrfalcon.org/