Word frequencies -- Python or Perl for performance?

Thu Mar 21 09:55:00 EST 2002

jimd at vega.starshine.org (Jim Dennis) writes:

> In article <mailman.1016223990.19235.python-list at python.org>, Nick Arnett wrote:
> 
> > Anybody have any experience generating word frequencies from short documents
> > with Python and Perl?  Given a choice between the two, I'm wondering what
> > will be faster.  And a related question... any idea if there will be a
> > significant performance hit (or advantage?) from storing the data in MySQL
> > v. my own file-based data structures?
> 
> > I'll be processing a fairly large number of short (1-6K or so) documents at
> > a time, so I'll be able to batch up things quite a bit.  I'm thinking that
> > the database might help me avoid loading up a lot of useless data.  Since
> > word frequencies follow a Zipf distribution, I'm guessing that I can spot
> > unusual words (my goal here) by loading up the top 80 percent or so of words
> > in the database (by occurrences) and focusing on the words that are in the
> > docs but not in the set retrieved from the database.
> 
> > Thanks for any thoughts on this and pointers to helpful examples or modules.
> 
> > Nick Arnett
> 
>  I don't know what you're really trying to do, but I decided to 
>  code up a quickie "word counter" for the hell of it.
> 
>  I started with one that would simply count "words" (white space 
>  separated sequences of letters, hyphens and apostrophes).  I then
>  decided to also denote which of them were "known" words and keep
>  a count of those as well.
> 
>  So, here's my very own version (in about 80 lines):
> 
> 
> #!/usr/bin/env python2.2

[program removed]

>  I tested by running the following commands:
> 
>  	for i in /bin/* /usr/bin/*; do 
> 		bname=$(basename $i); man $bname | col -b > /tmp/$bname.man
> 		done
> 
> 	time ./wordcount.py /tmp/*.man | hea speed.
> 
> 	Here's the output from that:
> 
> 1602048 1361723 36978 0.849988889222 0.0271553025101
>  117960 the*
>   41673 to*
>   36275 is*
>   34975 a
>   32191 of*
>   27045 and*
>   22881 in*
>   20336 for*
>   17571 be*
> Traceback (most recent call last):
>   File "./wordcount.py", line 81, in ?
>     print "%7d %s" % (count, word)
> IOError: [Errno 32] Broken pipe
> 
> real    1m48.212s
> user    1m47.950s
> sys     0m0.250s
> $ ls /tmp/*.man | ./wc.py 
>    1761    1761   31804 
> $ du /tmp/*.man 
> ....
> 15836   total
> $ find /tmp/*.man -printf "%s\n" \
> 	| awk '{n++; t+=$1}; END { print t/n ; }'
> 7104.33
> 
>  ... so it handled over 1700 medium size files (average 6K each,
>  about 14Mb total) in less than two minutes.  Of the words I 
>  counted it looks like about 84% of them were "known" words from
>  /usr/share/dict/words; and it looks like I found about 2% of the
>  known words.  (In other words, the Linux man pages only use about
>  2% of the English vocabulary).  I doubt the top ten words from my
>  list will surprise anyone: the, to, is, a, of, and, in ...
> 
>  I don't have the urge to write a version in Perl.  Not tonight
>  anyway.

Just to have something to compare with, you here have a Perl version
that does something I belive is along the same lines.

    #!/usr/bin/perl
    my %dict;
    my %known;
    my $known_words = 0;

    open(DICT,"/usr/share/lib/dict/words") or die "open: $!\n";
    {
        my @words = <DICT>;     # Read all the words into a list
        chop @words;            # Drop the trailing newlines
        $known_words = @words;  # Count known words
        # Enter the list of words into a hash
        @known{@words} =  (1) x $known_words;
    }
    close(DICT);

    undef $/;                   # Read each file in one go

    while (<>) {
        chop;                   # Drop trailing newline
        foreach (split(/\s+/)) {
            s/\'$|^\'|\'ll$|n\'t|\'s//;
            next if /^-|-$|\W|^\s*$/;
            ++$dict{lc $_};
            ++$total;
        }
    }

    print "Known: $known_words words, total: $total words\n";
    foreach (sort { $dict{$b} <=> $dict{$a} } (keys %dict)) {
        print "$dict{$_} $_*\n" if $dict{$_} > 1;
    }

I tried it on only approx. 600 manual pages, with the result:

    bash-2.01$ time perl wordcount.pl tmp/* | head
    Known: 25143 words, total: 464709 words
    44362 the*
    13174 is*
    12593 of*
    11907 a*
    11859 to*
    8735 and*
    7438 in*
    6255 for*
    6044 if*
    Broken Pipe

    real    0m42.669s
    user    0m22.190s
    sys     0m0.360s

>  Of course this script is free for any use you can think of.

Same here.

Good luck,

-- 
Mats Kindahl, IAR Systems, Sweden

Any opinions expressed are my own, not my company's.