Pythonic Porter stemmers (Was: Re: Word frequencies -- Python or Perl for performance?)

John Machin sjmachin at lexicon.net
Sun Mar 17 16:13:39 EST 2002


"Van Gale" <cgale1 at cox.net> wrote in message news:<Bn%k8.4574$J54.497662 at news1.west.cox.net>...
> W.B. Frakes and R. Baeza-Yates. 1992. "Information Retrieval: Data
> Structures and Algorithms," Prentice-Hall, describes the Porter algorithm as
> well as a few other stemming algorithms.  The reference for the algorithm
> is:
> 
>   Porter, M. F. 1980. "An Algorithm for Suffix Stripping." Program, 14(3),
> 130-37.
> 

Martin Porter has a home page for his stemming algorithm.

http://www.tartarus.org/~martin/PorterStemmer/index.html

Read all the way through to the last line.

> Frakes mentions nothing about a patent on the Porter algorithm, and I'd be
> surprised if there were since it was pretty rare back in the "good old
> days".

Check out Porter's personal home-page. Given the comment about his
family buying him a comb after he first put his photo on the web, I
get the impression not of patent-royalty-rich but of
archtypal-English-academic :-)

> 
> I worked on a huge indexing project for a legal publisher, and we developed
> our own stemming algorithm.  It was much simpler than Porter, basically
> being the most obvious conflations (like remove "s" and "ies") which covered
> the vast majority of English words, and then a list of "exceptions".  Of
> course we had the advantage of 50+ editorial staff capable of proofreading
> the index finding new exceptions, but I still think that's a better way to
> go than trying to stem completely by algorithm.  As hard as the Porter
> algorithm tries it still make a *lot* of mistakes.

And of course once you had an exception dictionary, for a performance
boost you'd consider dumping into it the 10^n most frequent words and
their "correct" stems whether or not the stemming algorithm gave the
"correct" result or not -- wouldn't you?



More information about the Python-list mailing list