how to remove the same words in the paragraph

S.Selvam s.selvamsiva at gmail.com
Mon Nov 9 02:21:33 EST 2009


On Wed, Nov 4, 2009 at 4:27 AM, Tim Chase <python.list at tim.thechases.com>wrote:

> kylin wrote:
>
>> I need to remove the word if it appears in the paragraph twice. could
>> some give me some clue or some useful function in the python.
>>
>
> Sounds like homework.  To fail your class, use this one:
>
> >>> p = "one two three four five six seven three four eight"
> >>> s = set()
> >>> print ' '.join(w for w in p.split() if not (w in s or s.add(w)))
> one two three four five six seven eight
>
> which is absolutely horrible because it mutates the set within the list
> comprehension.  The passable solution would use a for-loop to iterate over
> each word in the paragraph, emitting it if it hadn't already been seen.
>  Maintain those words in set, so your words know how not to be seen. ("Mr.
> Nesbitt, would you please stand up?")
>
> This also assumes your paragraph consists only of words and whitespace.
>  But since you posted your previous homework-sounding question on stripping
> out non-word/whitespace characters, you'll want to look into using a regexp
> like "[\w\s]" to clean up the cruft in the paragraph.  Neither solution
> above preserves non white-space/word characters, for which I'd recommend
> using a re.sub() with a callback.  Such a callback class might look
> something like
>
> >>> class Dedupe:
> ...     def __init__(self):
> ...             self.s = set()
> ...     def __call__(self, m):
> ...             w = m.group(0)
> ...             if w in self.s: return ''
> ...             self.s.add(w)
> ...             return w
> ...
> >>> r.sub(Dedupe(), p)
>
> where I leave the definition of "r" to the student.  Also beware of
> case-differences for which you might have to normalize.
>
> You'll also want to use more descriptive variable names than my one-letter
> tokens.
>
> -tkc
>
>
>
I think simple regex may come handy,

  p=re.compile(r'(.+) .*\1')    #note the space
  s=p.search("python and i love python")
  s.groups()
  (' python',)

But that matches for only one double word.Someone else could light up here
to extract all the double words.Then they can be removed from the original
paragraph.


>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



-- 
Yours,
S.Selvam
Sent from Bangalore, KA, India
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20091109/5088a518/attachment-0001.html>


More information about the Python-list mailing list