Whittle it on down

Steven D'Aprano steve at pearwood.info
Thu May 5 14:14:05 EDT 2016


On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote:

> Steven D'Aprano writes:
> 
>> I get something like this:
>>
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>
>>
>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
> 
> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
> when the middle part is just one LETTER. That's something of a
> misanalysis anyway. I notice that the correct pattern has already been
> posted at least thrice and you have acknowledged one of them.

Thrice? I've seen Peter's response (he made the trivial and obvious
simplification of just using A instead of [A-Z], but that was easy to
understand), and Random832 almost got it, missing only that you need to
match the entire string, not just a substring. If there was a third
response, I missed it.


> But I think you are also trying to do too much with a single regex. A
> more promising start is to think of the whole string as "parts" joined
> with "glue", then split with a glue pattern and test the parts:
> 
> import re
> glue = re.compile(" *& *| +")
> keep, drop = [], []
> for datum in data:
>     items = glue.split(datum)
>     if all(map(str.isupper, items)):
>         keep.append(datum)
>     else:
>         drop.append(datum)

Ah, the penny drops! For a while I thought you were suggesting using this to
assemble a regex, and it just wasn't making sense to me. Then I realised
you were using this as a matcher: feed in the list of strings, and it
splits it into strings to keep and strings to discard. Nicely done, that is
a good technique to remember.

Thanks for the analysis!



-- 
Steven




More information about the Python-list mailing list