How to escape strings for re.finditer?

avi.e.gross at gmail.com avi.e.gross at gmail.com
Mon Feb 27 20:56:00 EST 2023


Jen,

What you just described is why that tool is not the right tool for the job, albeit it may help you confirm if whatever method you choose does work correctly and finds the same number of matches.

Sometimes you simply do some searching and roll your own.

Consider this code using a sort of list comprehension feature:

>>> short = "hello world"
>>> longer =  "hello world is how many programs start for novices but some use hello world! to show how happy they are to say hello world"

>>> short in longer
True
>>> howLong = len(short)

>>> res = [(offset, offset + howLong)  for offset  in range(len(longer)) if longer.startswith(short, offset)]
>>> res
[(0, 11), (64, 75), (111, 122)]
>>> len(res)
3

I could do a bit more but it seems to work. Did I get the offsets right? Checking:

>>> print( [ longer[res[index][0]:res[index][1]] for index in range(len(res))])
['hello world', 'hello world', 'hello world']

Seems to work but thrown together quickly so can likely be done much nicer.

But as noted, the above has flaws such as matching overlaps like:

>>> short = "good good"
>>> longer = "A good good good but not douple plus good good good goody"
>>> howLong = len(short)
>>> res = [(offset, offset + howLong)  for offset  in range(len(longer)) if longer.startswith(short, offset)]
>>> res
[(2, 11), (7, 16), (37, 46), (42, 51), (47, 56)]

It matched five times as sometimes we had three of four good in a row. Some other method might match only three.

What some might do can get long and you clearly want one answer and not tutorials. For example, people can make a loop that finds a match and either sabotages the area by replacing or deleting it, or keeps track and searched again on a substring offset from the beginning. 

When you do not find a tool, consider making one. You can take (better) code than I show above and make it info a function and now you have a tool. Even better, you can make it return whatever you want.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Jen Kris via Python-list
Sent: Monday, February 27, 2023 7:40 PM
To: Bob van der Poel <bobmellowood at gmail.com>
Cc: Python List <python-list at python.org>
Subject: Re: How to escape strings for re.finditer?


string.count() only tells me there are N instances of the string; it does not say where they begin and end, as does re.finditer.  

Feb 27, 2023, 16:20 by bobmellowood at gmail.com:

> Would string.count() work for you then?
>
> On Mon, Feb 27, 2023 at 5:16 PM Jen Kris via Python-list <> python-list at python.org> > wrote:
>
>>
>> I went to the re module because the specified string may appear more 
>> than once in the string (in the code I'm writing).  For example:
>>  
>>  a = "X - abc_degree + 1 + qq + abc_degree + 1"
>>   b = "abc_degree + 1"
>>   q = a.find(b)
>>  
>>  print(q)
>>  4
>>  
>>  So it correctly finds the start of the first instance, but not the 
>> second one.  The re code finds both instances.  If I knew that the substring occurred only once then the str.find would be best.
>>  
>>  I changed my re code after MRAB's comment, it now works.
>>  
>>  Thanks much.
>>  
>>  Jen
>>  
>>  
>>  Feb 27, 2023, 15:56 by >> cs at cskk.id.au>> :
>>  
>>  > On 28Feb2023 00:11, Jen Kris <>> jenkris at tutanota.com>> > wrote:
>>  >
>>  >> When matching a string against a longer string, where both 
>> strings have spaces in them, we need to escape the spaces.  >>  >> 
>> This works (no spaces):
>>  >>
>>  >> import re
>>  >> example = 'abcdefabcdefabcdefg'
>>  >> find_string = "abc"
>>  >> for match in re.finditer(find_string, example):
>>  >>     print(match.start(), match.end())  >>  >> That gives me the 
>> start and end character positions, which is what I want.
>>  >>
>>  >> However, this does not work:
>>  >>
>>  >> import re
>>  >> example = re.escape('X - cty_degrees + 1 + qq')  >> find_string = 
>> re.escape('cty_degrees + 1')  >> for match in 
>> re.finditer(find_string, example):
>>  >>     print(match.start(), match.end())  >>  >> I’ve tried several 
>> other attempts based on my reseearch, but still no match.
>>  >>
>>  >
>>  > You need to print those strings out. You're escaping the _example_ string, which would make it:
>>  >
>>  >  X - cty_degrees \+ 1 \+ qq
>>  >
>>  > because `+` is a special character in regexps and so `re.escape` escapes it. But you don't want to mangle the string you're searching! After all, the text above does not contain the string `cty_degrees + 1`.
>>  >
>>  > My secondary question is: if you're escaping the thing you're searching _for_, then you're effectively searching for a _fixed_ string, not a pattern/regexp. So why on earth are you using regexps to do your searching?
>>  >
>>  > The `str` type has a `find(substring)` function. Just use that! It'll be faster and the code simpler!
>>  >
>>  > Cheers,
>>  > Cameron Simpson <>> cs at cskk.id.au>> >  > --  > >> 
>> https://mail.python.org/mailman/listinfo/python-list
>>  >
>>  
>>  --
>>  >> https://mail.python.org/mailman/listinfo/python-list
>>
>
>
> --
> **** Listen to my CD at > http://www.mellowood.ca/music/cedars>  **** 
> Bob van der Poel ** Wynndel, British Columbia, CANADA **
> EMAIL: > bob at mellowood.ca
> WWW:   > http://www.mellowood.ca
>

--
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list