Word frequencies -- Python or Perl for performance?
Mats Kindahl
matkin at iar.se
Thu Mar 21 09:55:00 EST 2002
jimd at vega.starshine.org (Jim Dennis) writes:
> In article <mailman.1016223990.19235.python-list at python.org>, Nick Arnett wrote:
>
> > Anybody have any experience generating word frequencies from short documents
> > with Python and Perl? Given a choice between the two, I'm wondering what
> > will be faster. And a related question... any idea if there will be a
> > significant performance hit (or advantage?) from storing the data in MySQL
> > v. my own file-based data structures?
>
> > I'll be processing a fairly large number of short (1-6K or so) documents at
> > a time, so I'll be able to batch up things quite a bit. I'm thinking that
> > the database might help me avoid loading up a lot of useless data. Since
> > word frequencies follow a Zipf distribution, I'm guessing that I can spot
> > unusual words (my goal here) by loading up the top 80 percent or so of words
> > in the database (by occurrences) and focusing on the words that are in the
> > docs but not in the set retrieved from the database.
>
> > Thanks for any thoughts on this and pointers to helpful examples or modules.
>
> > Nick Arnett
>
> I don't know what you're really trying to do, but I decided to
> code up a quickie "word counter" for the hell of it.
>
> I started with one that would simply count "words" (white space
> separated sequences of letters, hyphens and apostrophes). I then
> decided to also denote which of them were "known" words and keep
> a count of those as well.
>
> So, here's my very own version (in about 80 lines):
>
>
> #!/usr/bin/env python2.2
[program removed]
> I tested by running the following commands:
>
> for i in /bin/* /usr/bin/*; do
> bname=$(basename $i); man $bname | col -b > /tmp/$bname.man
> done
>
> time ./wordcount.py /tmp/*.man | hea speed.
>
> Here's the output from that:
>
> 1602048 1361723 36978 0.849988889222 0.0271553025101
> 117960 the*
> 41673 to*
> 36275 is*
> 34975 a
> 32191 of*
> 27045 and*
> 22881 in*
> 20336 for*
> 17571 be*
> Traceback (most recent call last):
> File "./wordcount.py", line 81, in ?
> print "%7d %s" % (count, word)
> IOError: [Errno 32] Broken pipe
>
> real 1m48.212s
> user 1m47.950s
> sys 0m0.250s
> $ ls /tmp/*.man | ./wc.py
> 1761 1761 31804
> $ du /tmp/*.man
> ....
> 15836 total
> $ find /tmp/*.man -printf "%s\n" \
> | awk '{n++; t+=$1}; END { print t/n ; }'
> 7104.33
>
> ... so it handled over 1700 medium size files (average 6K each,
> about 14Mb total) in less than two minutes. Of the words I
> counted it looks like about 84% of them were "known" words from
> /usr/share/dict/words; and it looks like I found about 2% of the
> known words. (In other words, the Linux man pages only use about
> 2% of the English vocabulary). I doubt the top ten words from my
> list will surprise anyone: the, to, is, a, of, and, in ...
>
> I don't have the urge to write a version in Perl. Not tonight
> anyway.
Just to have something to compare with, you here have a Perl version
that does something I belive is along the same lines.
#!/usr/bin/perl
my %dict;
my %known;
my $known_words = 0;
open(DICT,"/usr/share/lib/dict/words") or die "open: $!\n";
{
my @words = <DICT>; # Read all the words into a list
chop @words; # Drop the trailing newlines
$known_words = @words; # Count known words
# Enter the list of words into a hash
@known{@words} = (1) x $known_words;
}
close(DICT);
undef $/; # Read each file in one go
while (<>) {
chop; # Drop trailing newline
foreach (split(/\s+/)) {
s/\'$|^\'|\'ll$|n\'t|\'s//;
next if /^-|-$|\W|^\s*$/;
++$dict{lc $_};
++$total;
}
}
print "Known: $known_words words, total: $total words\n";
foreach (sort { $dict{$b} <=> $dict{$a} } (keys %dict)) {
print "$dict{$_} $_*\n" if $dict{$_} > 1;
}
I tried it on only approx. 600 manual pages, with the result:
bash-2.01$ time perl wordcount.pl tmp/* | head
Known: 25143 words, total: 464709 words
44362 the*
13174 is*
12593 of*
11907 a*
11859 to*
8735 and*
7438 in*
6255 for*
6044 if*
Broken Pipe
real 0m42.669s
user 0m22.190s
sys 0m0.360s
> Of course this script is free for any use you can think of.
Same here.
Good luck,
--
Mats Kindahl, IAR Systems, Sweden
Any opinions expressed are my own, not my company's.
More information about the Python-list
mailing list