How to escape RE

avi.e.gross at gmail.com avi.e.gross at gmail.com
Wed Mar 1 15:39:21 EST 2023


Cameron,

The topic is now Regular Expressions and the sin tax. This is not
exclusively a Python issue as everybody and even their grandmother uses it
in various forms.

I remember early versions of RE were fairly simple and readable. It was a
terse minilanguage that allowed fairly complex things to be done but was
readable.

You now encounter versions that make people struggle as countless extensions
have been sloppily grafted on. Who ordered multiple uses where "?" is now
used? As an example. Many places have sort of expanded the terseness and
both made it more and also less legible. UNICODE made lots of older RE
features  not very useful as definitions of things like what whitespace can
be and what a word boundary or contents might be are made so different that
new constructs were added to hold them.

But, if you are operating mainly on ASCII text, the base functionality is
till in there and can be used fairly easily.

Consider it a bit like other mini languages such as the print() variants
that kept adding functionality by packing lots of info tersely so you
specify you want a floating point number with so many digits and so on, and
by the way, right justified in a wider field and if it is negative, so this.
Great if you can still remember how to read it. 

I was reading a python book recently which kept using a suffix of !r and I
finally looked it up. It seems to be asking print (or perhaps an f string)
to use __repr__()  if possible to get the representation of the object. Then
I find out this is not really needed any more as the context now allows you
to use something like {repr(val)) so a val!r is not the only and confusing
way.

These mini-languages each require you to learn their own rules and quirks
and when you do, they can be powerful and intuitive, at least for the
features you memorized and maybe use regularly. 

Now RE knowledge is the same and it ports moderately well between languages
except when it doesn't. As has been noted, the people at PERL relied on it a
lot and kept changing and extending it. Some Python functionality lets you
specify if you want PERL style or other styles.

But hiding your head in the sand is not always going to work for long. No,
you do not need to use RE for simple cases. Mind you, that is when it is
easiest to use it reliably. I read some books related to XML where much of
the work had been done in non-UNIX land years ago and they often had other
ways of doing things in their endless series of methods on validating a
schema or declaring it so data is forced to match the declared objectives
such as what type(s) each item can be or whether some fields must exist
inside others or in a particular order, or say you can have only three of
them and seeming endless other such things. And then, suddenly, someone has
the idea to introduce the ability for you to specify many things using
regular expressions and the oppressiveness (for me) lifts and many things
can now be done trivially or that were not doable before. I had a similar
experience in my SQL reading where adding the ability to do some pattern
matching using a form of RE made life simpler.

The fact is that the idea of complex pattern matching IS complex and any
tool that lets you express it so fluidly will itself be complex. So, as some
have mentioned, find a resource that helps you build a regular expression
perhaps through menus, or one that verifies if one you created makes any
sense or lets you enter test data and have it show you how it is matching or
what to change to make it match differently. The multi-line version of RE
may also be helpful as well as sometimes breaking up a bigger one into
several smaller ones that your program uses in multiple phases.

Python recently added new functionality called Structural Pattern Matching.
You use a match statement with various cases that match patterns and if
matched, execute some action. Here is one tutorial if needed:

https://peps.python.org/pep-0636/

The point is that although not at all the same as a RE, we again have a bit
of a mini-language that can be used fairly concisely to investigate a
problem domain fairly quickly and efficiently and do things. It is an
overlapping but different form of pattern matching. And, in languages that
have long had similar ideas and constructs, people often cut back on using
other constructs like an IF statement, and just used something like this!

And consider this example as being vaguely like a bit of regular expression:

match command.split():
    case ["go", ("north" | "south" | "east" | "west")]:
        current_room = current_room.neighbor(...)

Like it or not, our future in programming is likely to include more and more
such aids along with headaches.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Grant Edwards
Sent: Wednesday, March 1, 2023 12:04 PM
To: python-list at python.org
Subject: Re: How to escape strings for re.finditer?

On 2023-02-28, Cameron Simpson <cs at cskk.id.au> wrote:

> Regexps are:
> - cryptic and error prone (you can make them more readable, but the 
>    notation is deliberately both terse and powerful, which means that 
>    small changes can have large effects in behaviour); the "error prone" 
>    part does not mean that a regexp is unreliable, but that writing one 
>    which is _correct_ for your task can be difficult,

The nasty thing is that writing one that _appears_ to be correct for your
task is often fairly easy. It will work as you expect for the test cases you
throw at it, but then fail in confusing ways when released into the "real
world". If you're lucky, it fails frequently and obviously enough that you
notice it right away. If you're not lucky, it will fail infrequently and
subtly for many years to come.

My rule: never use an RE if you can use the normal string methods (even if
it takes a a few lines of code using them to replace a single RE).

--
Grant
--
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list