Trouble with regular expressions

John Machin sjmachin at lexicon.net
Sat Feb 7 18:48:02 EST 2009


On Feb 8, 10:15 am, MRAB <goo... at mrabarnett.plus.com> wrote:
> John Machin wrote:
> > On Feb 8, 1:37 am, MRAB <goo... at mrabarnett.plus.com> wrote:
> >> LaundroMat wrote:
> >>> Hi,
> >>> I'm quite new to regular expressions, and I wonder if anyone here
> >>> could help me out.
> >>> I'm looking to split strings that ideally look like this: "Update: New
> >>> item (Household)" into a group.
> >>> This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
> >>> ("Update", "New item", "(Household)")
> >>> Some strings will look like this however: "Update: New item (item)
> >>> (Household)". The expression above still does its job, as it returns
> >>> ("Update", "New item (item)", "(Household)").
>
> > Not quite true; it actually returns
> >     ('Update:', ' New item (item) ', '(Household)')
> > However ignoring the difference in whitespace, the OP's intention is
> > clear. Yours returns
> >     ('Update:', ' New item ', '(item) (Household)')
>
> The OP said it works OK, which I took to mean that the OP was OK with
> the extra whitespace, which can be easily stripped off. Close enough!

As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's "works OK" included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.

>
> >>> It does not work however when there is no text in parentheses (eg
> >>> "Update: new item"). How can I get the expression to return a tuple
> >>> such as ("Update:", "new item", None)?
> >> You need to make the last group optional and also make the middle group
> >> lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.
>
> > Why do you perpetuate the redundant ^ anchor?
>
> The OP didn't say whether search() or match() was being used. With the ^
> it doesn't matter.

It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 4.37 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9




More information about the Python-list mailing list