Regular expressions

Christian Gollwitzer auriocus at gmx.de
Thu Nov 5 03:18:58 EST 2015


Am 05.11.15 um 06:59 schrieb rurpy at yahoo.com:
>> Can you call yourself a well-rounded programmer without at least a basic
>> understanding of some regex library? Well, probably not. But that's part of
>> the problem with regexes. They have, to some degree, driven out potentially
>> better -- or at least differently bad -- pattern matching solutions, such
>> as (E)BNF grammars, SNOBOL pattern matching, or lowly globbing patterns. Or
>> even alternative idioms, like Hypercard's "chunking" idioms.
>
> Hmm, very good point.  I wonder why all those "potentially better"
> solutions have not been more widely adopted?  A conspiracy by a
> secret regex cabal?

I'm mostly on the pro-side of the regex discussion, but this IS a valid 
point. regexes are not always a good way to express a pattern, even if 
the pattern is regular. The point is, that you can't build them up 
easily piece-by-piece. Say, you want a regex like "first an 
international phone number, then a name, then a second phone number" - 
you will have to *repeat* the pattern for phone number twice. In more 
complex cases this can become a nightmare, like the monster that was 
mentioned before to validate an email.

A better alternative, then, is PEG for example. You can easily write

pattern <- phone_number name phone_number
phone_number <- '+' [0-9]+ ( '-' [0-9]+ )*
name <-  [[:alpha:]]+

or something similar using a PEG parser. It has almost the same 
quantifiers as a Regex, is much more readable, runs in linear time over 
all inputs and can parse languages with the approximately the same 
complexity as the Knuth style parsers (LR(k) etc.), but without 
ambiguity. I'm really astonished that PEG parsing is not better 
supported in the world of computing, instead most people choose to stick 
to the lexer+scanner combination

Finally, an anecdote from my "early" life of computing. In 1990, when I 
was 12 years old, I participated in an annual competition of computer 
science for high school students. I was learning how to program without 
formal training, and solved one problem where a grammar was depicted as 
a flowchart and the task was to write parser for it, to check the 
validity of input strings. The grammar is depicted here (problem 1):

http://www.auriocus.de/StringKurs/RegEx/uebungen1.pdf

As a 12 year old, not knowing anything about pattern recognition, but 
thinking I was the king, as is usual for boys in that age, I sat down 
and manually constructed a recursive descent parser in a BASIC like 
language. It had 1000 lines and took me a few weeks to get it correct. 
Finally the solution was accepted as working, but my participation was 
rejected because the solutions lacked documentation. 16 years later I 
used the problem for a course on string processing (that's what the PDF 
is for), and asked the students to solve it using regexes. My own 
solution consists of 67 characters, and it took me5 minutes to write it 
down.

Admittedly, this problem is constructed, but solving similar tasks by 
regexes is still something that I need to do on a daily basis, when I 
get data from other scientists in odd formats and I need to preprocess 
them. I know people who use a spreadsheet and copy/paste millions of 
datapoints manually becasue they lack the knowledge of using such tools.

	Christian




More information about the Python-list mailing list