replace only full words

Sat Sep 28 13:37:49 EDT 2013

MRAB writes:

> On 28/09/2013 17:11, cerr wrote:
> > Hi,
> >
> > I have a list of sentences and a list of words. Every full word
> > that appears within sentence shall be extended by <WORD> i.e. "I
> > drink in the house." Would become "I <drink> in the <house>." (and
> > not "I <d<rink> in the <house>.")I have attempted it like this:
>
> >    for sentence in sentences:
> >      for noun in nouns:
> >        if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
> > 	sentence = sentence.replace(noun, '<' + noun + '>')
> >
> >      print(sentence)
> >
> > but what if The word is in the beginning of a sentence and I also
> > don't like the approach using defined word terminations. Also, is
> > there a way to make it faster?
> >
> It sounds like a regex problem to me:
> 
> import re
> 
> nouns = ["drink", "house"]
> 
> pattern = re.compile(r"\b(" + "|".join(nouns) + r")\b")
> 
> for sentence in sentences:
>      sentence = pattern.sub(r"<\g<0>>", sentence)
>      print(sentence)

Maybe tokenize by a regex and then join the replacements of all
tokens:

import re

def substitute(token):
   if isfullword(token.lower()):
      return '<{}>'.format(token)
   else:
      return token

def tokenize(sentence):
   return re.split(r'(\W)', sentence) 

sentence = 'This is, like, a test.'

tokens = map(substitute, tokenize(sentence))
sentence = ''.join(tokens)

For better results, both tokenization and substitution need to depend
on context. Doing some of that should be an interesting exercise.