PyWart: Python regular expression syntax is not intuitive.

Rick Johnson rantingrickjohnson at gmail.com
Wed Jan 25 12:16:01 EST 2012


In particular i find the "extension notation" syntax to be woefully
inadequate. You should be able to infer the action of the extension
syntax intuitively, simply from looking at its signature. I find
myself continually needing to consult the docs because of the lacking
or misleading style of the current syntax. Consider:

(...) # Group Capture
Okay here. Parenthesis feel very natural for delimiting a group.

(?...)  # Base Extension Syntax
All extensions are wrapped in parenthesis and start with a question
mark, but i believe the question mark was a very bad choice, since the
question mark is already specific to "zero or one repetitions of
preceding RE". This simple error is why i believe Python re's are so
damn difficult to eyeball parse. You'll constantly be forced to spend
too much time deciding if this question mark is a referring to
repeats, or is the start of an extension syntax. We should have
choosen another char, and the char should NOT be known to RE in any
other place. Maybe the tilde would work? Wait, i have a MUCH better
idea!!!

Actually the best choice would have been using BRACES instead of
PARENTHESIS to delimit the extension syntax, since parenthesis are
used (wisely i might add!) for group captures.  Also, anything
contained in braces is more likely to be understood (by almost all
programmers) as a "command block" -- unfortunately some idiot decided
to use braces for specifying ranges! WHAT A F'ING WASTE of intuitive
chars!

(?iLmsux) # Passing Flags Internally
This is ridiculous. re's are cryptic enough without inviting TIMTOWDI
over to play. Passing flags this way does nothing BUT harm
readability. Please people, pass your flags as an argument to the
appropriate re.method() and NOT as another cryptic syntax.

(?:...) # Non-Capturing Group
When i look at this pattern "non-capturing" DOES NOT scream out at me,
and again, the question mark is used incorrectly. When i think of a
char that screams NEGATIVE, i think of the exclamation mark, NOT the
question mark. And how the HELL is the colon helping me to interpret
this syntax?

(?P<name>...) # Named Group Capture
(?P=name) # Named Group Reference
(?#...)  # Comment

################################################
## The following assertions are highly flawed ##
################################################

(?=...)  # positive look ahead
(?!...)  # negative look ahead
(?<=...) # positive look behind
(?<!...) # negative look behind

I cannot decipher these patterns in their current syntactical forms.
Too much information is missing or misleading. I have no idea which
pattern is looking forward, which pattern is looking backward, which
is pattern negative, and which pattern is positive. I need syntactical
clues! Consider these:

(?>=...) #Read as "forward equals pattern?"
(?>!=...) #Read as "forward NOT equals pattern?"
(?<=...) #Read as "backwards equals pattern?"
(?<!=...) #Read as "backwards NOT equals pattern?"

However, i really don't like the fact that negative assertions need
one extra char than positive assertions. Here is an alternative:

(?>+...) #Read as "forward equals pattern?"
(?>-...) #Read as "forward NOT equals pattern?"
(?<+...) #Read as "backwards equals pattern?"
(?<-...) #Read as "backwards NOT equals pattern?"

Looks much better HOWEVER we still have too much useless noise.
Replace the parenthesis delimiters with braces, and drop the "where's
waldo" question mark,  and we have a simplistically intuitive
syntactical bliss!

{...}  # Base Extension Syntax
{iLmsux}  # Passing Flags Internally
{!()...} or (!...) # Non Capturing.
{NG=identifier...}  # Named Group Capture
{NG.name}  # Named Group Reference
{#...}  # Comment
{>+...}  # Positive Look Ahead Assertion
{>-...}  # Negative Look Ahead Assertion
{<+...}  # Positive Look Behind Assertion
{<-...}  # Positive Look Behind Assertion
{(id/name)yes-pat|no-pat}

*school-bell-rings*

PS: In my eyes, Python 3000 is already a dinosaur.



More information about the Python-list mailing list