How to escape strings for re.finditer?

avi.e.gross at gmail.com avi.e.gross at gmail.com
Tue Feb 28 15:25:05 EST 2023


Jen,

 

I had no doubt the code you ran was indented properly or it would not work.

 

I am merely letting you know that somewhere in the process of copying the code or the transition between mailers, my version is messed up. It happens to be easy for me to fix but I sometimes see garbled code I then simply ignore.

 

At times what may help is to leave blank lines that python ignores but also keeps the line rearrangements minimal.

 

On to your real question.

 

In my OPINION, there are many interesting questions that can get in the way of just getting a working solution. Some may be better in some abstract way but except for big projects it often hardly matters.

 

So regex is one thing or more a cluster of things and a list comp is something completely different. They are both tools you can use and abuse or lose.

 

The distinction I believe we started with was how to find a fixed string inside another fixed string in as many places as needed and perhaps return offset info. So this can be solved in too many ways using a side of python focused on pure text. As discussed, solutions can include explicit loops such as “for” and “while” and their syntactic sugar cousin of a list comp. Not mentioned yet are other techniques like a recursive function that finds the first and passes on the rest of the string to itself to find the rest, or various functional programming techniques that may do sort of hidden loops. YOU DO NOT NEED ALL OF THEM but it can be interesting to learn.

 

Regex is a completely different universe that is a bit more of MORE. If I ask you for a ride to the grocery store, I might expect you to show up with a car and not a James Bond vehicle that also is a boat, submarine, airplane, and maybe spaceship. Well, Regex is the latter. And in your case, it is this complexity that meant you had to convert your text so it will not see what it considers commands or hints.

 

In normal use, put a bit too simply, it wants a carefully crafted pattern to be spelled out and it weaves an often complex algorithm it then sort of compiles that represents the understanding of what you asked for. The simplest pattern is to match EXACTLY THIS. That is your case.

 

A more complex pattern may say to match Boston OR Chicago followed by any amount of whitespace then a number of digits between 3 and 5 and then should not be followed by something specific. Oh, and by the way, save selected parts in parentheses to be accessed as \1 or \2 so I can ask you to do things like match a word followed by itself. It goes on and on. 

 

Be warned RE is implemented now all over the place including outside the usual UNIX roots and there are somewhat different versions. For your need, it does not matter.

 

The compiled monstrosity though can be fairly fast and might be a tad hard for you to write by yourself as a bunch of if statements nested that are  weirdly matching various patterns with some look ahead or look behind. 

 

What you are being told is that despite this being way more than you asked for, it not only works but is fairly fast when doing the simple thing you asked for. That may be why a text version you are looking for is hard to find.

 

I am not clear what exactly the rest of your project is about but my guess is your first priority is completing it decently and not to try umpteen methods and compare them. Not today. Of course if the working version is slow and you profile it and find this part seems to be holding it back, it may be worth examining.

 

 

From: Jen Kris <jenkris at tutanota.com> 
Sent: Tuesday, February 28, 2023 12:58 PM
To: avi.e.gross at gmail.com
Cc: 'Python List' <python-list at python.org>
Subject: RE: How to escape strings for re.finditer?

 

The code I sent is correct, and it runs here.  Maybe you received it with a carriage return removed, but on my copy after posting, it is correct:

 

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

 find_string = re.escape('abc_degree + 1')

 for match in re.finditer(find_string, example):

     print(match.start(), match.end())

 

One question:  several people have made suggestions other than regex (not your terser example with regex you shown below).  Is there a reason why regex is not preferred to, for example, a list comp?  Performance?  Reliability?  

 

 

 

  

 

 

Feb 27, 2023, 18:16 by avi.e.gross at gmail.com <mailto:avi.e.gross at gmail.com> :

Jen,

 

Can you see what SOME OF US see as ASCII text? We can help you better if we get code that can be copied and run as-is.

 

What you sent is not terse. It is wrong. It will not run on any python interpreter because you somehow lost a carriage return and indent.

 

This is what you sent:

 

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())

 

This is code indentedproperly:

 

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') 

for match in re.finditer(find_string, example):

print(match.start(), match.end())

 

Of course I am sure you wrote and ran code more like the latter version but somewhere in your copy/paste process, ....

 

And, just for fun, since there is nothing wrong with your code, this minor change is terser:

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

for match in re.finditer(re.escape('abc_degree + 1') , example):

... print(match.start(), match.end())

... 

... 

4 18

26 40

 

But note once you use regular expressions, and not in your case, you might match multiple things that are far from the same such as matching two repeated words of any kind in any case including "and and" and "so so" or finding words that have multiple doubled letter as in the stereotypical bookkeeper. In those cases, you may want even more than offsets but also show the exact text that matched or even show some characters before and/or after for context.

 

 

-----Original Message-----

From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org <mailto:python-list-bounces+avi.e.gross=gmail.com at python.org> > On Behalf Of Jen Kris via Python-list

Sent: Monday, February 27, 2023 8:36 PM

To: Cameron Simpson <cs at cskk.id.au <mailto:cs at cskk.id.au> >

Cc: Python List <python-list at python.org <mailto:python-list at python.org> >

Subject: Re: How to escape strings for re.finditer?

 

 

I haven't tested it either but it looks like it would work. But for this case I prefer the relative simplicity of:

 

example = 'X - abc_degree + 1 + qq + abc_degree + 1'

find_string = re.escape('abc_degree + 1') for match in re.finditer(find_string, example):

print(match.start(), match.end())

 

4 18

26 40

 

I don't insist on terseness for its own sake, but it's cleaner this way. 

 

Jen

 

 

Feb 27, 2023, 16:55 by cs at cskk.id.au <mailto:cs at cskk.id.au> :

On 28Feb2023 01:13, Jen Kris <jenkris at tutanota.com <mailto:jenkris at tutanota.com> > wrote:

I went to the re module because the specified string may appear more than once in the string (in the code I'm writing).

 

Sure, but writing a `finditer` for plain `str` is pretty easy (untested):

 

pos = 0

while True:

found = s.find(substring, pos)

if found < 0:

break

start = found

end = found + len(substring)

... do whatever with start and end ...

pos = end

 

Many people go straight to the `re` module whenever they're looking for strings. It is often cryptic error prone overkill. Just something to keep in mind.

 

Cheers,

Cameron Simpson <cs at cskk.id.au <mailto:cs at cskk.id.au> >

--

https://mail.python.org/mailman/listinfo/python-list

 

-- 

https://mail.python.org/mailman/listinfo/python-list

 



More information about the Python-list mailing list