Replace stop words (remove words from a string)

Gary Herron gherron at islandtraining.com
Thu Jan 17 03:45:13 EST 2008


BerlinBrown wrote:
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this.  I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;
>
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
>
> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))
>   
String have a replace method that will produce a new string with (all
occurrences of) one substring replaced with another.  You'd have to loop
through your stop_list one word at a time. 

>>> s = 'abcxyzabc'
>>> s.replace('xyz','')
'abcabc'


If either the string or the stop_list grows particularly large, this
approach won't scale very well since the whole string would be
re-created anew for each stop_list entry.  In that case, I'd look into
the regular expression (re) module.  You may be able to finagle a way to
find and replace all stop_list entries in one pass.  (Finding them all
is easy -- not so sure you could replace them all at once though.  )


Gary Herron





More information about the Python-list mailing list