Code improvement question

MRAB python at mrabarnett.plus.com
Tue Nov 14 18:25:10 EST 2023


On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
> I'd like to improve the code below, which works. It feels clunky to me.
> 
> I need to clean up user-uploaded files the size of which I don't know in
> advance.
> 
> After cleaning they might be as big as 1Mb but that would be super rare.
> Perhaps only for testing.
> 
> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
> xxxxxxx-xx-x eg., 1012300-77-4
> 
> def remove_alpha(txt):
> 
>       """  r'[^0-9\- ]':
> 
>       [^...]: Match any character that is not in the specified set.
> 
>       0-9: Match any digit.
> 
>       \: Escape character.
> 
>       -: Match a hyphen.
> 
>       Space: Match a space.
> 
>       """
> 
>       cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
> 
>       bits = cleaned_txt.split()
> 
>       pieces = []
> 
>       for bit in bits:
> 
>           # minimum size of a CAS number is 7 so drop smaller clumps of digits
> 
>           pieces.append(bit if len(bit) > 6 else "")
> 
>       return " ".join(pieces)
> 
> 
> Many thanks for any hints
> 
Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)



More information about the Python-list mailing list