how to remove the same words in the paragraph

Tim Chase python.list at tim.thechases.com
Tue Nov 3 17:57:14 EST 2009


kylin wrote:
> I need to remove the word if it appears in the paragraph twice. could
> some give me some clue or some useful function in the python.

Sounds like homework.  To fail your class, use this one:

 >>> p = "one two three four five six seven three four eight"
 >>> s = set()
 >>> print ' '.join(w for w in p.split() if not (w in s or s.add(w)))
one two three four five six seven eight

which is absolutely horrible because it mutates the set within 
the list comprehension.  The passable solution would use a 
for-loop to iterate over each word in the paragraph, emitting it 
if it hadn't already been seen.  Maintain those words in set, so 
your words know how not to be seen. ("Mr. Nesbitt, would you 
please stand up?")

This also assumes your paragraph consists only of words and 
whitespace.  But since you posted your previous homework-sounding 
question on stripping out non-word/whitespace characters, you'll 
want to look into using a regexp like "[\w\s]" to clean up the 
cruft in the paragraph.  Neither solution above preserves non 
white-space/word characters, for which I'd recommend using a 
re.sub() with a callback.  Such a callback class might look 
something like

 >>> class Dedupe:
...     def __init__(self):
...             self.s = set()
...     def __call__(self, m):
...             w = m.group(0)
...             if w in self.s: return ''
...             self.s.add(w)
...             return w
...
 >>> r.sub(Dedupe(), p)

where I leave the definition of "r" to the student.  Also beware 
of case-differences for which you might have to normalize.

You'll also want to use more descriptive variable names than my 
one-letter tokens.

-tkc







More information about the Python-list mailing list