[Python-ideas] What about regexp string litterals : re".*" ?

Fri Mar 31 04:20:48 EDT 2017

Hi all,

FWIW, I also strongly prefer the Verbal Expression style and consider
"normal" regular expressions to become quickly unreadable and
unmaintainable.

Verbal Expressions are also much more composable.

Stephan

2017-03-31 9:23 GMT+02:00 Stephen J. Turnbull
<turnbull.stephen.fw at u.tsukuba.ac.jp>:
> Abe Dillon writes:
>
>  > Note that the entire documentation is 250 words while just the syntax
>  > portion of Python docs for the re module is over 3000 words.
>
> Since Verbal Expressions (below, VEs, indicating notation) "compile"
> to regular expressions (spelling out indicates the internal matching
> implementation), the documentation of VEs presumably ignores
> everything except the limited language it's useful for.  To actually
> understand VEs, you need to refer to the RE docs.  Not a win IMO.
>
>  > > You think that example is more readable than the proposed transalation
>  > >     ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$
>  > > which is better written
>  > >     ^https?://(www\.)?[^ ]*$
>  > > or even
>  > >     ^https?://[^ ]*$
>  >
>  >
>  > Yes. I find it *far* more readable. It's not a soup of symbols like Perl
>  > code. I can only surmise that you're fluent in regex because it seems
>  > difficult for you to see how the above could be less readable than English
>  > words.
>
> Yes, I'm fairly fluent in regular expression notation (below, REs).
> I've maintained a compiler for one dialect.
>
> I'm not interested in the difference between words and punctuation
> though.  The reason I find the middle RE most readable is that it
> "looks like" what it's supposed to match, in a contiguous string as
> the object it will match will be contiguous.  If I need to parse it to
> figure out *exactly* what it matches, yes, that takes more effort.
> But to understand a VE's semantics correctly, I'd have to look it up
> as often as you have to look up REs because many words chosen to notate
> VEs have English meanings that are (a) ambiguous, as in all natural
> language, and (b) only approximate matches to RE semantics.
>
>  > I could tell it only matches URLs that are the only thing inside
>  > the string because it clearly says: start_of_line() and
>  > end_of_line().
>
> That's not the problem.  The problem is the semantics of the method
> "find".  "then" would indeed read better, although it doesn't exactly
> match the semantics of concatenation in REs.
>
>  > I would have had to refer to a reference to know that "^" doesn't
>  > always mean "not", it sometimes means "start of string" and
>  > probably other things. I would also have to check a reference to
>  > know that "$" can mean "end of string" (and probably other things).
>
> And you'll still have to do that when reading other people's REs.
>
>  > > Are those groups capturing in Verbal Expressions?  The use of
>  > > "find" (~ "search") rather than "match" is disconcerting to the
>  > > experienced user.
>  >
>  > You can alternately use the word "then". The source code is just
>  > one python file. It's very easy to read. I actually like "then"
>  > over "find" for the example:
>
> You're missing the point.  The reader does not get to choose the
> notation, the author does.  I do understand what several varieties of
> RE mean, but the variations are of two kinds: basic versus extended
> (ie, what tokens need to be escaped to be taken literally, which ones
> have special meaning if escaped), and extensions (which can be
> ignored).  Modern RE facilities are essentially all of the extended
> variety.  Once you've learned that, you're in good shape for almost
> any RE that should be written outside of an obfuscated code contest.
>
> This is a fundamental principle of Python design: don't make readers
> of code learn new things.  That includes using notation developed
> elsewhere in many cases.
>
>  > What does alternation look like?
>  >
>  > .OR(option1).OR(option2).OR(option3)...
>  >
>  > How about alternation of
>  > > non-trivial regular expressions?
>  >
>  > .OR(other_verbal_expression)
>
> Real examples, rather than pseudo code, would be nice.  I think you,
> too, will find that examples of even fairly simple nested alternations
> containing other constructs become quite hard to read, as they fall
> off the bottom of the screen.
>
> For example, the VE equivalent of
>
>     scheme = "(https?|ftp|file):"
>
> would be (AFAICT):
>
>     scheme = VerEx().then(VerEx().then("http")
>                                  .maybe("s")
>                                  .OR("ftp")
>                                  .OR("file"))
>                     .then(":")
>
> which is pretty hideous, I think.  And the colon is captured by a
> group.  If perversely I wanted to extract that group from a match,
> what would its index be?
>
> I guess you could keep the linear arrangement with
>
>     scheme = (VerEx().add("(")
>                      .then("http")
>                      .maybe("s")
>                      .OR("ftp")
>                      .OR("file")
>                      .add(")")
>                      .then(":"))
>
> but is that really an improvement over
>
>     scheme = VerEx().add("(https?|ftp|file):")
>
> ;-)
>
>  > > As far as I can see, Verbal Expressions are basically a way of
>  > > making it so painful to write regular expressions that people
>  > > will restrict themselves to regular expressions
>  >
>  > What's so painful to write about them?
>
> One thing that's painful is that VEs "look like" context-free
> grammars, but clumsy and without the powerful semantics.  You can get
> the readability you want with greater power using grammars, which is
> why I would prefer we work on getting a parser module into the stdlib.
>
> But if one doesn't know about grammars, it's still not great.  The
> main pains about writing VEs for me are (1) reading what I just wrote,
> (2) accessing capturing groups, and (3) verbosity.  Even a VE to
> accurately match what is normally a fairly short string, such as the
> scheme, credentials, authority, and port portions of a "standard" URL,
> is going to be hundreds of characters long and likely dozens of lines
> if folded as in the examples.
>
> Another issue is that we already have a perfectly good poor man's
> matching library: glob.  The URL example becomes
>
>     http{,s}://{,www.}*
>
> Granted you lose the anchors, but how often does that matter?  You
> apparently don't use them often enough to remember them.
>
>  > Does your IDE not have autocompletion?
>
> I don't want an IDE.  I have Emacs.
>
>  > I find REs so painful to write that I usually just use string
>  > methods if at all feasible.
>
> Guess what?  That's the right thing to do anyway.  They're a lot more
> readable and efficient when partitioning a string into two or three
> parts, or recognizing a short list of affixes.  But chaining many
> methods, as VEs do, is not a very Pythonic way to write a program.
>
>  > > I don't think that this failure to respect the developer's taste
>  > > is restricted to this particular implementation, either.
>  >
>  > I generally find it distasteful to write a pseudolanguage in
>  > strings inside of other languages (this applies to SQL as well).
>
> You mean like arithmetic operators?  (Lisp does this right, right?
> Only one kind of expression, the function call!)  It's a matter of
> what you're used to.  I understand that people new to text-processing,
> or who don't do so much of it, don't find REs easy to read.  So how is
> this a huge loss?  They don't use regular expressions very often!  In
> fact, they're far more likely to encounter, and possibly need to
> understand, REs written by others!
>
>  > Especially when the design principals of that pseudolanguage are
>  > *diametrically opposed* to the design principals of the host
>  > language. A key principal of Python's design is: "you read code a
>  > lot more often than you write code, so emphasize
>  > readability". Regex seems to be based on: "Do the most with the
>  > fewest key-strokes.
>
> So is all of mathematics.  There's nothing wrong with concise
> expression for use in special cases.
>
>  > Readability be dammed!". It makes a lot more sense to wrap the
>  > psudolanguage in constructs that bring it in-line with the host
>  > language than to take on the mental burden of trying to comprehend
>  > two different languages at the same time.
>  >
>  > If you disagree, nothing's stopping you from continuing to write
>  > res the old-fashion way.
>
> I don't think that RE and SQL are "pseudo" languages, no.  And I, and
> most developers, will continue to write regular expressions using the
> much more compact and expressive RE notation.  (In fact with the
> exception of the "word" method, in VEs you still need to use RE notion
> to express most of the Python extensions.)  So what you're saying is
> that you don't read much code, except maybe your own.  Isn't that your
> problem?  Those of us who cooperate widely on applications using
> regular expressions will continue to communicate using REs.  If that
> leaves you out, that's not good.  But adding VEs to the stdlib (and
> thus encouraging their use) will split the community into RE users and
> VE users, if VEs are at all useful.  That's a bad.  I don't see that
> the potential usefulness of VEs to infrequent users of regular
> expressions outweighing the downsides of "many ways to do it" in the
> stdlib.
>
>  > Can we at least agree that baking special re syntax directly into
>  > the language is a bad idea?
>
> I agree that there's no particular need for RE literals.  If one wants
> to mark an RE as some special kind of object, re.compile() does that
> very well both by converting to a different type internally and as a
> marker syntactically.
>
>  > On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>  >
>  > > We don't really want to ease the use of regexps in Python - while
>  > > they're an incredibly useful tool in a programmer's toolkit,
>  > > they're so cryptic that they're almost inevitably a
>  > > maintainability nightmare.
>
> I agree with Nick.  Regular expressions, whatever the notation, are a
> useful tool (no suspension of disbelief necessary for me, though!).
> But they are cryptic, and it's not just the notation.  People (even
> experienced RE users) are often surprised by what fairly simple
> regular expression match in a given text, because people want to read
> a regexp as instructions to a one-pass greedy parser, and it isn't.
>
> For example, above I wrote
>
>     scheme = "(https?|ftp|file):"
>
> rather than
>
>     scheme = "(\w+):"
>
> because it's not unlikely that I would want to treat those differently
> from other schemes such as mailto, news, and doi.  In many
> applications of regular expressions (such as tokenization for a
> parser) you need many expressions.  Compactness really is a virtue in
> REs.
>
> Steve
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/