function to remove and punctuation

Thomas 'PointedEars' Lahn PointedEars at web.de
Sun Apr 10 10:23:12 EDT 2016


Peter Otten wrote:

> geshdus at gmail.com wrote:
>> how to write a function taking a string parameter, which returns it after
>> you delete the spaces, punctuation marks, accented characters in python ?
> 
> Looks like you want to remove more characters than you want to keep. In
> this case I'd decide what characters too keep first, e. g. (assuming
> Python 3)

However, with *that* approach (which is different from the OP’s request), 
regular expression matching might turn out to be more efficient:

-----------------------------------------------------------
import re
print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
-----------------------------------------------------------

With the OP’s original request, they may still be the better approach.
For example:

----------------------------------------------------------------------
import re
print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...", 
                     flags=re.IGNORECASE)))
----------------------------------------------------------------------

or

----------------------------------------------------------------------
import re
print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...", 
                         flags=re.IGNORECASE)))
----------------------------------------------------------------------

>>>> import string
>>>> keep = string.ascii_letters + string.digits
>>>> keep
> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
> 
> Now you can iterate over the characters and check if you want to preserve
> it for each of them:

The good thing about this part of the approach you suggested is that you can 
build regular expressions from strings, too:

  keep = '[' + 'a-z' + r'\d' + ']'
 
>>>> def clean(s, keep):
> ...     return "".join(c for c in s if c in keep)
> ...

Why would one prefer this over "".filter(lambda: c in keep, s)?

>>>> clean("<alpha> äöü ::42", keep)
> 'alpha42'
>>>> clean("<alpha> äöü ::42", string.ascii_letters)
> 'alpha'
> 
> If you are dealing with a lot of text you can make this a bit more
> efficient with the str.translate() method. Create a mapping that maps all
> characters that you want to keep to themselves
> 
>>>> m = str.maketrans(keep, keep)
>>>> m[ord("a")]
> 97
>>>> m[ord(">")]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> KeyError: 62
> 
> and all characters that you want to discard to None

Why would creating a *larger* list for *more* operations be *more* 
efficient? 

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.



More information about the Python-list mailing list