How to escape strings for re.finditer?

Tue Feb 28 15:40:26 EST 2023

This message is more for Thomas than Jen,

You made me think of what happens in fairly large cases. What happens if I ask you to search a thousand pages looking for your name? 

One solution might be to break the problem into parts that can be run in independent threads or processes and perhaps across different CPU's or on many machines at once. Think of it as a variant on a merge sort where each chunk returns where it found one or more items and then those are gathered together and merged upstream.

The problem is you cannot just randomly divide the text.  Any matches across a divide are lost. So if you know you are searching for "Thomas Passin" you need an overlap big enough to hold enough of that size. It would not be made as something like a pure binary tree and if the choices made included variant sizes in what might match, you would get duplicates. So the merging part would obviously have to eventually remove those.

I have often wondered how Google and other such services are able to find millions of things in hardly any time and arguably never show most of them as who looks past a few pages/screens?

I think much of that may involve other techniques including quite a bit of pre-indexing. But they also seem to enlist lots of processors that each do the search on a subset of the problem space and combine and prioritize.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Thomas Passin
Sent: Tuesday, February 28, 2023 1:31 PM
To: python-list at python.org
Subject: Re: How to escape strings for re.finditer?

On 2/28/2023 1:07 PM, Jen Kris wrote:
> 
> Using str.startswith is a cool idea in this case.  But is it better 
> than regex for performance or reliability?  Regex syntax is not a 
> model of simplicity, but in my simple case it's not too difficult.

The trouble is that we don't know what your case really is.  If you are talking about a short pattern like your example and a small text to search, and you don't need to do it too often, then my little code example is probably ideal. Reliability wouldn't be an issue, and performance would not be relevant.  If your case is going to be much larger, called many times in a loop, or be much more complicated in some other way, then a regex or some other approach is likely to be much faster.

> Feb 27, 2023, 18:52 by list1 at tompassin.net:
> 
>     On 2/27/2023 9:16 PM, avi.e.gross at gmail.com wrote:
> 
>         And, just for fun, since there is nothing wrong with your code,
>         this minor change is terser:
> 
>                     example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>                     for match in re.finditer(re.escape('abc_degree + 1')
>                     , example):
> 
>         ... print(match.start(), match.end())
>         ...
>         ...
>         4 18
>         26 40
> 
> 
>     Just for more fun :) -
> 
>     Without knowing how general your expressions will be, I think the
>     following version is very readable, certainly more readable than
>     regexes:
> 
>     example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>     KEY = 'abc_degree + 1'
> 
>     for i in range(len(example)):
>     if example[i:].startswith(KEY):
>     print(i, i + len(KEY))
>     # prints:
>     4 18
>     26 40
> 
>     If you may have variable numbers of spaces around the symbols, OTOH,
>     the whole situation changes and then regexes would almost certainly
>     be the best approach. But the regular expression strings would
>     become harder to read.
>     -- 
>     https://mail.python.org/mailman/listinfo/python-list
> 
> 

--
https://mail.python.org/mailman/listinfo/python-list