Code improvement question

Mike Dewhirst miked at dewhirst.com.au
Thu Nov 16 20:15:43 EST 2023


On 15/11/2023 3:08 pm, MRAB via Python-list wrote:
> On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:
>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>> I'd like to improve the code below, which works. It feels clunky to 
>>>> me.
>>>>
>>>> I need to clean up user-uploaded files the size of which I don't 
>>>> know in
>>>> advance.
>>>>
>>>> After cleaning they might be as big as 1Mb but that would be super 
>>>> rare.
>>>> Perhaps only for testing.
>>>>
>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>
>>>> def remove_alpha(txt):
>>>>
>>>>       """  r'[^0-9\- ]':
>>>>
>>>>       [^...]: Match any character that is not in the specified set.
>>>>
>>>>       0-9: Match any digit.
>>>>
>>>>       \: Escape character.
>>>>
>>>>       -: Match a hyphen.
>>>>
>>>>       Space: Match a space.
>>>>
>>>>       """
>>>>
>>>>       cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>
>>>>       bits = cleaned_txt.split()
>>>>
>>>>       pieces = []
>>>>
>>>>       for bit in bits:
>>>>
>>>>           # minimum size of a CAS number is 7 so drop smaller 
>>>> clumps of digits
>>>>
>>>>           pieces.append(bit if len(bit) > 6 else "")
>>>>
>>>>       return " ".join(pieces)
>>>>
>>>>
>>>> Many thanks for any hints
>>>>
>>> Why don't you use re.findall?
>>>
>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>
>> I think I can see what you did there but it won't make sense to me - or
>> whoever looks at the code - in future.
>>
>> That answers your specific question. However, I am in awe of people who
>> can just "do" regular expressions and I thank you very much for what
>> would have been a monumental effort had I tried it.
>>
>> That little re.sub() came from ChatGPT and I can understand it without
>> too much effort because it came documented
>>
>> I suppose ChatGPT is the answer to this thread. Or everything. Or 
>> will be.
>>
> \b          Word boundary
> [0-9]{2,7}  2..7 digits
> -           "-"
> [0-9]{2}    2 digits
> -           "-"
> [0-9]{2}    2 digits
> \b          Word boundary
>
> The "word boundary" thing is to stop it matching where there are 
> letters or digits right next to the digits.
>
> For example, if the text contained, say, "123456789-12-1234", you 
> wouldn't want it to match because there are more than 7 digits at the 
> start and more than 2 digits at the end.
>
Thanks

I know I should invest some brainspace in re. Many years ago at a Perl 
conferenceI did buy a coffee mug completely covered with a regex cheat 
sheet. It currently holds pens and pencils on my desk. And spiders now I 
look closely!

Then I took up Python and re is different.

Maybe I'll have another look ...

Cheers

Mike

-- 
Signed email is an absolute defence against phishing. This email has
been signed with my private key. If you import my public key you can
automatically decrypt my signature and be sure it came from me. Your
email software can handle signing.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <https://mail.python.org/pipermail/python-list/attachments/20231117/47e1e7db/attachment.sig>


More information about the Python-list mailing list