Code improvement question

Thu Nov 16 20:22:46 EST 2023

On 2023-11-17 01:15, Mike Dewhirst via Python-list wrote:
> On 15/11/2023 3:08 pm, MRAB via Python-list wrote:
>> On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:
>>> On 15/11/2023 10:25 am, MRAB via Python-list wrote:
>>>> On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
>>>>> I'd like to improve the code below, which works. It feels clunky to 
>>>>> me.
>>>>>
>>>>> I need to clean up user-uploaded files the size of which I don't 
>>>>> know in
>>>>> advance.
>>>>>
>>>>> After cleaning they might be as big as 1Mb but that would be super 
>>>>> rare.
>>>>> Perhaps only for testing.
>>>>>
>>>>> I'm extracting CAS numbers and here is the pattern xx-xx-x up to
>>>>> xxxxxxx-xx-x eg., 1012300-77-4
>>>>>
>>>>> def remove_alpha(txt):
>>>>>
>>>>>       """  r'[^0-9\- ]':
>>>>>
>>>>>       [^...]: Match any character that is not in the specified set.
>>>>>
>>>>>       0-9: Match any digit.
>>>>>
>>>>>       \: Escape character.
>>>>>
>>>>>       -: Match a hyphen.
>>>>>
>>>>>       Space: Match a space.
>>>>>
>>>>>       """
>>>>>
>>>>>       cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
>>>>>
>>>>>       bits = cleaned_txt.split()
>>>>>
>>>>>       pieces = []
>>>>>
>>>>>       for bit in bits:
>>>>>
>>>>>           # minimum size of a CAS number is 7 so drop smaller 
>>>>> clumps of digits
>>>>>
>>>>>           pieces.append(bit if len(bit) > 6 else "")
>>>>>
>>>>>       return " ".join(pieces)
>>>>>
>>>>>
>>>>> Many thanks for any hints
>>>>>
>>>> Why don't you use re.findall?
>>>>
>>>> re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
>>>
>>> I think I can see what you did there but it won't make sense to me - or
>>> whoever looks at the code - in future.
>>>
>>> That answers your specific question. However, I am in awe of people who
>>> can just "do" regular expressions and I thank you very much for what
>>> would have been a monumental effort had I tried it.
>>>
>>> That little re.sub() came from ChatGPT and I can understand it without
>>> too much effort because it came documented
>>>
>>> I suppose ChatGPT is the answer to this thread. Or everything. Or 
>>> will be.
>>>
>> \b          Word boundary
>> [0-9]{2,7}  2..7 digits
>> -           "-"
>> [0-9]{2}    2 digits
>> -           "-"
>> [0-9]{2}    2 digits
>> \b          Word boundary
>>
>> The "word boundary" thing is to stop it matching where there are 
>> letters or digits right next to the digits.
>>
>> For example, if the text contained, say, "123456789-12-1234", you 
>> wouldn't want it to match because there are more than 7 digits at the 
>> start and more than 2 digits at the end.
>>
> Thanks
> 
> I know I should invest some brainspace in re. Many years ago at a Perl
> conferenceI did buy a coffee mug completely covered with a regex cheat
> sheet. It currently holds pens and pencils on my desk. And spiders now I
> look closely!
> 
> Then I took up Python and re is different.
> 
> Maybe I'll have another look ...
> 
The patterns themselves aren't that different; Perl's just has more 
features than the re module's.