How to escape strings for re.finditer?

avi.e.gross at gmail.com avi.e.gross at gmail.com
Mon Feb 27 22:47:57 EST 2023


I think by now we have given all that is needed by the OP but Dave's answer
strikes me as being able to be a tad faster as a while loop if you are
searching  larger corpus such as an entire ebook or all books as you can do
on books.google.com

I think I mentioned earlier that some assumptions need to apply. The text
needs to be something like an ASCII encoding or seen as code points rather
than bytes. We assume a match should move forward by the length of the
match. And, clearly, there cannot be a match too close to the end.

So a while loop would begin with a variable set to zero to mark the current
location of the search. The condition for repeating the loop is that this
variable is less than or equal to len(searched_text) - len(key)

In the loop, each comparison is done the same way as David uses, or anything
similar enough but the twist is a failure increments the variable by 1 while
success increments by len(key).

Will this make much difference? It might as the simpler algorithm counts
overlapping matches and wastes some time hunting where perhaps it shouldn't.

And, of course, if you made something like this into a search function, you
can easily add features such as asking that you only return the first N
matches or the next N, simply by making it a generator.
So tying this into an earlier discussion, do you want the LAST match info
visible when the While loop has completed? If it was available, it opens up
possibilities for running the loop again but starting from where you left
off.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Thomas Passin
Sent: Monday, February 27, 2023 9:44 PM
To: python-list at python.org
Subject: Re: How to escape strings for re.finditer?

On 2/27/2023 9:16 PM, avi.e.gross at gmail.com wrote:
> And, just for fun, since there is nothing wrong with your code, this minor
change is terser:
> 
>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
> ...     print(match.start(), match.end())
> ...
> ...
> 4 18
> 26 40

Just for more fun :) -

Without knowing how general your expressions will be, I think the following
version is very readable, certainly more readable than regexes:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
     if example[i:].startswith(KEY):
         print(i, i + len(KEY))
# prints:
4 18
26 40

If you may have variable numbers of spaces around the symbols, OTOH, the
whole situation changes and then regexes would almost certainly be the best
approach.  But the regular expression strings would become harder to read.
--
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list