How to escape strings for re.finditer?

avi.e.gross at gmail.com avi.e.gross at gmail.com
Mon Feb 27 19:34:49 EST 2023


Just FYI, Jen, there are times a sledgehammer works but perhaps is not the only way. These days people worry less about efficiency and more about programmer time and education and that can be fine.

But it you looked at methods available in strings or in some other modules, your situation is quite common. Some may use another RE front end called finditer().

I am NOT suggesting you do what I say next, but imagine writing a loop that takes a substring of what you are searching for of the same length as your search string. Near the end, it stops as there is too little left.

You can now simply test your searched for string against that substring for equality and it tends to return rapidly when they are not equal early on.

Your loop would return whatever data structure or results you want such as that it matched it three times at offsets a, b and c.

But do you allow overlaps? If not, your loop needs to skip len(search_str) after a match.

What you may want to consider is another form of pre-processing. Do you care if "abc_degree + 1" has missing or added spaces at the tart or end or anywhere in middle as in " abc_degree +1"?

Do you care if stuff is a different case like "Abc_Degree + 1"?

Some such searches can be done if both the pattern and searched string are first converted to a canonical format that maps to the same output. But that complicates things a bit and you may to display what you match differently.

And are you also willing to match this: "myabc_degree + 1"?

When using a crafter RE there is a way to ask for a word boundary so abc will only be matched if before that is a space or the start of the string and not "my".

So this may be a case where you can solve an easy version with the chance it can be fooled or overengineer it. If you are allowing the user to type in what to search for, as many programs including editors, do, you will often find such false positives unless the user knows RE syntax and applies it and you do not escape it. I have experienced havoc when doing a careless global replace that matched more than I expected, including making changes in comments or constant strings rather than just the name of a function. Adding a paren is helpful as is not replacing them all but one at a time and skipping any that are not wanted.

Good luck.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:14 PM
To: Cameron Simpson <cs at cskk.id.au>
Cc: Python List <python-list at python.org>
Subject: Re: How to escape strings for re.finditer?


I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).  For example:  

a = "X - abc_degree + 1 + qq + abc_degree + 1"
 b = "abc_degree + 1"
 q = a.find(b)

print(q)
4

So it correctly finds the start of the first instance, but not the second one.  The re code finds both instances.  If I knew that the substring occurred only once then the str.find would be best.  

I changed my re code after MRAB's comment, it now works.  

Thanks much.  

Jen


Feb 27, 2023, 15:56 by cs at cskk.id.au:

> On 28Feb2023 00:11, Jen Kris <jenkris at tutanota.com> wrote:
>
>> When matching a string against a longer string, where both strings 
>> have spaces in them, we need to escape the spaces.
>>
>> This works (no spaces):
>>
>> import re
>> example = 'abcdefabcdefabcdefg'
>> find_string = "abc"
>> for match in re.finditer(find_string, example):
>>     print(match.start(), match.end())
>>
>> That gives me the start and end character positions, which is what I 
>> want.
>>
>> However, this does not work:
>>
>> import re
>> example = re.escape('X - cty_degrees + 1 + qq') find_string = 
>> re.escape('cty_degrees + 1') for match in re.finditer(find_string, 
>> example):
>>     print(match.start(), match.end())
>>
>> I’ve tried several other attempts based on my reseearch, but still no 
>> match.
>>
>
> You need to print those strings out. You're escaping the _example_ string, which would make it:
>
>  X - cty_degrees \+ 1 \+ qq
>
> because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>
> My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>
> The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>
> Cheers,
> Cameron Simpson <cs at cskk.id.au>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

-- 
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list