Pythonic Porter stemmers (Was: Re: Word frequencies -- Python or Perl for performance?)

Van Gale cgale1 at cox.net
Sun Mar 17 06:47:45 EST 2002


W.B. Frakes and R. Baeza-Yates. 1992. "Information Retrieval: Data
Structures and Algorithms," Prentice-Hall, describes the Porter algorithm as
well as a few other stemming algorithms.  The reference for the algorithm
is:

  Porter, M. F. 1980. "An Algorithm for Suffix Stripping." Program, 14(3),
130-37.

Frakes mentions nothing about a patent on the Porter algorithm, and I'd be
surprised if there were since it was pretty rare back in the "good old
days".

I worked on a huge indexing project for a legal publisher, and we developed
our own stemming algorithm.  It was much simpler than Porter, basically
being the most obvious conflations (like remove "s" and "ies") which covered
the vast majority of English words, and then a list of "exceptions".  Of
course we had the advantage of 50+ editorial staff capable of proofreading
the index finding new exceptions, but I still think that's a better way to
go than trying to stem completely by algorithm.  As hard as the Porter
algorithm tries it still make a *lot* of mistakes.

--
Van Gale






More information about the Python-list mailing list