function to remove and punctuation

Sun Apr 10 11:52:24 EDT 2016

Thomas 'PointedEars' Lahn wrote:

> Peter Otten wrote:
> 
>> geshdus at gmail.com wrote:
>>> how to write a function taking a string parameter, which returns it
>>> after you delete the spaces, punctuation marks, accented characters in
>>> python ?
>> 
>> Looks like you want to remove more characters than you want to keep. In
>> this case I'd decide what characters too keep first, e. g. (assuming
>> Python 3)
> 
> However, with *that* approach (which is different from the OP’s request),
> regular expression matching might turn out to be more efficient:
> 
> -----------------------------------------------------------
> import re
> print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
> -----------------------------------------------------------
> 
> With the OP’s original request, they may still be the better approach.
> For example:
> 
> ----------------------------------------------------------------------
> import re
> print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
>                      flags=re.IGNORECASE)))
> ----------------------------------------------------------------------
> 
> or
> 
> ----------------------------------------------------------------------
> import re
> print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
>                          flags=re.IGNORECASE)))
> ----------------------------------------------------------------------
> 
>>>>> import string
>>>>> keep = string.ascii_letters + string.digits
>>>>> keep
>> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>> 
>> Now you can iterate over the characters and check if you want to preserve
>> it for each of them:
> 
> The good thing about this part of the approach you suggested is that you
> can build regular expressions from strings, too:
> 
>   keep = '[' + 'a-z' + r'\d' + ']'
>  
>>>>> def clean(s, keep):
>> ...     return "".join(c for c in s if c in keep)
>> ...
> 
> Why would one prefer this over "".filter(lambda: c in keep, s)?

Because it's idiomatic Python and easy to understand if you are coming from 
the imperative

buf = []
for c in s:
    if c in keep:
        buf.append(c)
"".join(buf)

Because it uses Python syntax instead of the filter/map/reduce trio.

Because it avoids the extra function call (the lambda) though the speed 
difference is not as big as I expected:

$ python3 -m timeit -s 'import string; keep = string.ascii_letters + 
string.digits; s = "alphabet soup ä" * 1000' '"".join(filter(lambda c: c in 
keep, s))'
100 loops, best of 3: 4.66 msec per loop

$ python3 -m timeit -s 'import string; keep = string.ascii_letters + 
string.digits; s = "alphabet soup ä" * 1000' '"".join(c for c in s if c in 
keep)'
100 loops, best of 3: 3.11 msec per loop

For reference here is a variant using regular expressions (picked at random, 
feel free to find a faster one):

$ python3 -m timeit -s 'import string, re; keep = string.ascii_letters + 
string.digits; s = "alphabet soup ä" * 1000; sub=re.compile(r"[^a-zA-
Z0-9]+").sub' 'sub("", s)'
1000 loops, best of 3: 1.65 msec per loop

And finally str.translate():

$ python3 -m timeit -s 'import string, collections as c; keep = 
string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000; trans = 
c.defaultdict(lambda: None, str.maketrans(keep, keep))' 's.translate(trans)'
1000 loops, best of 3: 997 usec per loop

>>>>> clean("<alpha> äöü ::42", keep)
>> 'alpha42'
>>>>> clean("<alpha> äöü ::42", string.ascii_letters)
>> 'alpha'
>> 
>> If you are dealing with a lot of text you can make this a bit more
>> efficient with the str.translate() method. Create a mapping that maps all
>> characters that you want to keep to themselves
>> 
>>>>> m = str.maketrans(keep, keep)
>>>>> m[ord("a")]
>> 97
>>>>> m[ord(">")]
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> KeyError: 62
>> 
>> and all characters that you want to discard to None
> 
> Why would creating a *larger* list for *more* operations be *more*
> efficient?
> 

I don't understand the question. If you mean that the trans dict may become 
large -- that depends on the input data. The characters to be deleted are 
lazily added to the defaultdict. For text in european languages the total 
size should stay well below 256 entries. But you are probably aiming at 
something else...