Code improvement question

avi.e.gross at gmail.com avi.e.gross at gmail.com
Sat Nov 18 01:55:13 EST 2023


Many features like regular expressions can be mini languages that are designed to be very powerful while also a tad cryptic to anyone not familiar.

But consider an alternative in some languages that may use some complex set of nested function calls that each have names like match_white_space(2, 5) and even if some are set up to be sort of readable, they can be a pain. Quite a few problems can be solved nicely with a single regular expression or several in a row with each one being fairly simple. Sometimes you can do parts using some of the usual text manipulation functions built-in or in a module for either speed or to simplify things so that the RE part is simpler and easier to follow.

And, as noted, Python allows ways to include comments in RE or ways to specify extensions such as PERL-style and so on. Adding enough comments above or within the code can help remind people or point to a reference and just explaining in English (or the language of your choice that hopefully others later can understand) can be helpful. You can spell out in whatever level of detail what you expect your data to look like and what you want to match or extract and then the RE may be easier to follow.

Of course the endless extensions added due to things like supporting UNICODE have made some RE much harder to create or understand and sometimes the result may not even be what you expected if something strange happens like the symbols ①❹⓸ 

The above might match digits and maybe be interpreted at some point as 12 dozen, which may even be appropriate but a bit of a surprise perhaps.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Peter J. Holzer via Python-list
Sent: Friday, November 17, 2023 6:18 AM
To: python-list at python.org
Subject: Re: Code improvement question

On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:
> > > Why don't you use re.findall?
> > > 
> > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
> > 
> > I think I can see what you did there but it won't make sense to me - or
> > whoever looks at the code - in future.
> > 
> > That answers your specific question. However, I am in awe of people who
> > can just "do" regular expressions and I thank you very much for what
> > would have been a monumental effort had I tried it.
> 
> I feel the same way about regex. If I can find a way to write something
> without regex I very much prefer to as regex usually adds complexity and
> hurts readability.

I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.

(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)

There are some extensions to regexps which are conceptually harder, like
lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.


> You might find https://regex101.com/ to be useful for testing your regex.
> You can enter in sample data and see if it matches.
> 
> If I understood what your regex was trying to do I might be able to suggest
> some python to do the same thing. Is it just removing numbers from text?

Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).

> > > re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b         - a word boundary.
[0-9]{2,7} - 2 to 7 digits
-          - a hyphen-minus
[0-9]{2}   - exactly 2 digits
-          - a hyphen-minus
[0-9]{2}   - exactly 2 digits
\b         - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"



More information about the Python-list mailing list