Whittle it on down

Steven D'Aprano steve at pearwood.info
Thu May 5 14:03:23 EDT 2016


On Thu, 5 May 2016 11:21 pm, Random832 wrote:

> On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote:
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
> 
> Well, obviously *your* language (not the OP's), given the cases you
> reject, is "one or more sequences of letters separated by
> space*-ampersand-space*", and that is actually one of the easiest kinds
> of regex to write: "[A-Z]+( *& *[A-Z]+)*".

One of the easiest kind of regex to write incorrectly:

py> re.match("[A-Z]+( *& *[A-Z]+)*", "A----")
<_sre.SRE_Match object at 0xb7bf4aa0>


It doesn't even get the "all uppercase" part of the specification:

py> re.match("[A-Z]+( *& *[A-Z]+)*", "Azzz")
<_sre.SRE_Match object at 0xb7bf4aa0>

You failed to anchor the string at the beginning and end of the string, an
easy mistake to make, but that's the point. It's easy to make mistakes with
regexes because the syntax is so overly terse and unforgiving.

But I think I just learned something important today. I learned that's it's
not actually regexes that I dislike, it's regex culture that I dislike.
What I learned from this thread:


- Nobody could possibly want to support non-ASCII text. (Apart from the
approximately 6.5 billion people in the world that don't speak English of
course, an utterly insignificant majority.)

- Data validity doesn't matter, because there's no possible way that you
might accidentally scrape data from the wrong part of a HTML file and end
up with junk input.

- Even if you do somehow end up with junk, there couldn't possibly be any
real consequences to that.

- It doesn't matter if you match too much, or to little, that just means the
specs are too pedantic.


Hence the famous quote:

    Some people, when confronted with a problem, think 
    "I know, I'll use regular expressions." Now they 
    have two problems.


It's not really regexes that are the problem.


> However, your spec is wrong:

How can you say that? It's *my* spec, I can specify anything I want.


>> - Leading or trailing spaces, or spaces not surrounding an ampersand,
>> must not match: "AAA BBB" must be rejected.
> 
> The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS
> CONSULTANTS & TRAINERS'.

That's very nice, but irrelevant. I'm not talking about the OP's outputs.
I'm giving my own.




-- 
Steven




More information about the Python-list mailing list