Regular expressions

rurpy at yahoo.com rurpy at yahoo.com
Tue Nov 3 19:22:53 EST 2015


On 11/03/2015 12:15 AM, Steven D'Aprano wrote:
> On Tue, 3 Nov 2015 03:23 pm, rurpy wrote:
> 
>> Regular expressions should be learned by every programmer or by anyone
>> who wants to use computers as a tool.  They are a fundamental part of
>> computer science and are used in all sorts of matching and searching
>> from compilers down to your work-a-day text editor.
> 
> You are absolutely right.
> 
> If only regular expressions weren't such an overly-terse, cryptic
> mini-language, with all but no debugging capabilities, they would be great.
> 
> If only there wasn't an extensive culture of regular expression abuse within
> programming communities, they would be fine.
> 
> All technologies are open to abuse. But we don't say:
> 
>   Some people, when confronted with a problem, think "I know, I'll use
>   arithmetic." Now they have two problems.
> 
> because abuse of arithmetic is rare. It's hard to misuse it, and while
> arithmetic can be complicated, it's rare for programmers to abuse it. But
> the same cannot be said for regexes -- they are regularly misused, abused,
> and down-right hard to use right even when you have a good reason for using
> them:
> 
> http://www.thedailywtf.com/articles/Irregular_Expression
> 
> http://blog.codinghorror.com/regex-use-vs-regex-abuse/
> 
> http://psung.blogspot.com.au/2008/01/wonderful-abuse-of-regular-expressions.html

Thanks for pointing out three cases of misuse of regexes out of the
approximately 375000000 [*] uses of regexes in the wild. I hope you're
not dumb enough to think that constitutes significant evidence.

Even worse, of the three only one was a real example. One of the others
was machine-generated code, the other was a "look what you can do with
regexes" example, not serious code.

Here is an example of "abusing" python

  https://benkurtovic.com/2014/06/01/obfuscating-hello-world.html

I wouldn't use this as evidence that Python is to be avoided.

> If there is one person who has done more to create a regex culture, it is
> Larry Wall, inventor of Perl. Even Larry Wall says that regexes are
> overused and their syntax is harmful, and he has recreated them for Perl 6:
> 
> http://www.perl.com/pub/2002/06/04/apo5.html

You really should have read beyond the first paragraph. He proposes
fixing regexes by adding even more special character combinations and
making regexes even *more* powerful. (He turned them into full-blown
parsers.)

Nowhere does he advocate not using, or avoiding if possible, regexes
as is the mantra in this list.

Here is Larry's "recreation" that you are touting:

  http://design.perl6.org/S05.html

Please explain to us how you think this "fix" addresses the complaints
you and other Python anti-regexers have about regexes.

I hope you also noted Larry's tongue-in-cheek writing style. Right after
pointing out that some claim Perl is hard to read due largely to regex
syntax, he writes:

  "Funny that other languages have been borrowing Perl's regular
  expressions as fast as they can..."

So I don't think you can claim Larry Wall as a supporter of this list's
anti-regex attitude beyond some superficial verbiage taken out of context.

> Oh, and the icing on the cake, regexes can be a security vulnerability too:
> https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS

And here is a list of CVEs involving Python. There are (at time of
writing) 190 of them.

  http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=python

So if a security vulnerability is reason not to use regexes, we should
all be *running* from Python. I sure you'll point out that most have
been fixed.

But you failed to point out that same is true of regex engines. From
your source:

  "Notice, that not all algorithms are naïve, and actually Regex
  algorithms can be written in an efficient way."

And in fact, again, had you looked beyond a headline that suited your
purpose, you could have tried the "Evil Regexes" noted in that source
and discovered none of them are a DoS in Python.

Even were that not true, normal practice applies: if the input is
untrusted then sanitize it, or mitigate the threat by imposing a timeout,
etc. Not exactly a problem or solution unique to regexes. And common
sense should tell you that since there are a lot of "try a regex" web
sites, this is not a problem without a solution.

And *certainly* not a reason not to use them in the *far* more common
case when they *are* trusted because you are in control of them,

Finally, preemptively, I'll repeat I acknowledge regexs are not the
the optimum solution in every case where they could be used. But they
are very useful when one passes the border of the trivial; and they are
nowhere near as bad as routinely portrayed here.

----
[*] Yes, I made that number up.



More information about the Python-list mailing list