get word base

Bengt Richter bokr at oz.net
Fri Jun 28 17:12:58 EDT 2002


On Fri, 28 Jun 2002 15:14:50 -0500, John Hunter <jdhunter at nitace.bsd.uchicago.edu> wrote:

>
>I would like to be able to get the root/base of a word by stripping
>off plurals, gerund endings, participle endings etc...  Here is a
>totally naive first attempt that gets it right sometimes:
>
>import re
>
>rgx = re.compile( '(\w+?)(?:ing|ed|es|s)')
>
>def get_base(word):
>
>    m = rgx.match(word)
>    if m:
>        return m.group(1)
>    else:
>        return word
>
>words = ['hello', 'taxes', 'thoughts', 'walked', 'rakes']
>
>for word in words:
>    print word, get_base(word)
>
>Produces the following output
>> python get_baseword.py
>hello hello
>taxes tax
>thoughts thought
>walked walk
>rakes rak
>
>
>I can think of a few things to do to refine this, but before I forge
>ahead, I wanted to solicit advice.
>
Google for python stemmer ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list