Whittle it on down

Thu May 5 13:49:26 EDT 2016

Steven D'Aprano writes:

> I get something like this:
>
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>
>
> but it fails on strings like "AA   &  A &  A". What am I doing wrong?

It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
when the middle part is just one LETTER. That's something of a
misanalysis anyway. I notice that the correct pattern has already been
posted at least thrice and you have acknowledged one of them.

But I think you are also trying to do too much with a single regex. A
more promising start is to think of the whole string as "parts" joined
with "glue", then split with a glue pattern and test the parts:

import re
glue = re.compile(" *& *| +")
keep, drop = [], []
for datum in data:
    items = glue.split(datum)
    if all(map(str.isupper, items)):
        keep.append(datum)
    else:
        drop.append(datum)

That will cope with Greek, by the way.

It's annoying that the order of the branches of the glue pattern above
matters. One _does_ have problems when one uses the usual regex engines.

Capturing groups in the glue pattern would produce glue items in the
split output. Either avoid them or deal with them: one could split with
the underspecific "([ &]+)" and then check that each glue item contains
at most one ampersand. One could also allow other punctuation, and then
check afterwards.

One can use _another_ regex to test individual parts. Code above used
str.isupper to test a part. The improved regex package (from PyPI, to
cope with Greek) can do the same:

import regex
part = regex.compile("[[:upper:]]+")
glue = regex.compile(" *& *| *")

keep, drop = [], []
for datum in data:
    items = glue.split(datum)
    if all(map(part.fullmatch, items)):
        keep.append(datum)
    else:
        drop.append(datum)

Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most
of Finnish; the [:upper:] class is nicer and there's much more that is
nicer in the newer regex package.

The point of using a regex for this is that the part pattern can then be
generalized to allow some punctuation or digits in a part, for example.
Anything that the glue pattern doesn't consume. (Nothing wrong with
using other techniques for this, either; str.isupper worked nicely
above.)

It's also possible to swap the roles of the patterns. Split with a part
pattern. Then check that the text between such parts is glue:

keep, drop = [], []
for datum in data:
    items = part.split(datum)
    if all(map(glue.fullmatch, items)):
        keep.append(datum)
    else:
        drop.append(datum)

The point is to keep the patterns simple by making them more local, or
more relaxed, followed by a further test. This way they can be made to
do more, but not more than they reasonably can.

Note also the use of re.fullmatch instead of re.match (let alone
re.search) when a full match is required! This gets rid of all anchors
in the pattern, which may in turn allow fewer parentheses inside the
pattern.

The usual regex engines are not perfect, but parts of them are
fantastic.