How to escape strings for re.finditer?

Tue Feb 28 13:16:54 EST 2023

I wrote my previous message before reading this.  Thank you for the test you ran -- it answers the question of performance.  You show that re.finditer is 30x faster, so that certainly recommends that over a simple loop, which introduces looping overhead.  


Feb 28, 2023, 05:44 by list1 at tompassin.net:

> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>
>> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>>
>>> On 2/27/2023 9:16 PM, avi.e.gross at gmail.com wrote:
>>>
>>>> And, just for fun, since there is nothing wrong with your code, this minor change is terser:
>>>>
>>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>>>
>>>> ...     print(match.start(), match.end())
>>>> ...
>>>> ...
>>>> 4 18
>>>> 26 40
>>>>
>>>
>>> Just for more fun :) -
>>>
>>> Without knowing how general your expressions will be, I think the following version is very readable, certainly more readable than regexes:
>>>
>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>> KEY = 'abc_degree + 1'
>>>
>>> for i in range(len(example)):
>>>     if example[i:].startswith(KEY):
>>>         print(i, i + len(KEY))
>>> # prints:
>>> 4 18
>>> 26 40
>>>
>> I think it's often a good idea to use a standard library function instead of rolling your own. The issue becomes less clear-cut when the standard library doesn't do exactly what you need (as here, where re.finditer() uses regular expressions while the use case only uses simple search strings). Ideally there would be a str.finditer() method we could use, but in the absence of that I think we still need to consider using the almost-but-not-quite fitting re.finditer().
>>
>> Two reasons:
>>
>> (1) I think it's clearer: the name tells us what it does (though of course we could solve this in a hand-written version by wrapping it in a suitably named function).
>>
>> (2) Searching for a string in another string, in a performant way, is not as simple as it first appears. Your version works correctly, but slowly. In some situations it doesn't matter, but in other cases it will. For better performance, string searching algorithms jump ahead either when they found a match or when they know for sure there isn't a match for some time (see e.g. the Boyer–Moore string-search algorithm). You could write such a more efficient algorithm, but then it becomes more complex and more error-prone. Using a well-tested existing function becomes quite attractive.
>>
>
> Sure, it all depends on what the real task will be.  That's why I wrote "Without knowing how general your expressions will be". For the example string, it's unlikely that speed will be a factor, but who knows what target strings and keys will turn up in the future?
>
>> To illustrate the difference performance, I did a simple test (using the paragraph above is test text):
>>
>>      import re
>>      import timeit
>>
>>      def using_re_finditer(key, text):
>>          matches = []
>>          for match in re.finditer(re.escape(key), text):
>>              matches.append((match.start(), match.end()))
>>          return matches
>>
>>
>>      def using_simple_loop(key, text):
>>          matches = []
>>          for i in range(len(text)):
>>              if text[i:].startswith(key):
>>                  matches.append((i, i + len(key)))
>>          return matches
>>
>>
>>      CORPUS = """Searching for a string in another string, in a performant way, is
>>      not as simple as it first appears. Your version works correctly, but slowly.
>>      In some situations it doesn't matter, but in other cases it will. For better
>>      performance, string searching algorithms jump ahead either when they found a
>>      match or when they know for sure there isn't a match for some time (see e.g.
>>      the Boyer–Moore string-search algorithm). You could write such a more
>>      efficient algorithm, but then it becomes more complex and more error-prone.
>>      Using a well-tested existing function becomes quite attractive."""
>>      KEY = 'in'
>>      print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), number=1000))
>>      print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), number=1000))
>>
>> This does 5 runs of 1000 repetitions each, and reports the time in seconds for each of those runs.
>> Result on my machine:
>>
>>      using_simple_loop: [0.13952950000020792, 0.13063130000000456, 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>>      using_re_finditer: [0.003861400000005233, 0.004061900000124297, 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>>
>> We find that in this test re.finditer() is more than 30 times faster (despite the overhead of regular expressions.
>>
>> While speed isn't everything in programming, with such a large difference in performance and (to me) no real disadvantages of using re.finditer(), I would prefer re.finditer() over writing my own.
>>
>
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>