Regex Match Problem

Garry Knight garryknight at gmx.net
Wed Mar 10 14:03:20 EST 2004


In message <8nhu4059js0lrc10ks9io6p4j70r7abj96 at 4ax.com>, David MacQuigg
wrote:

> On 10 Mar 2004 07:17:12 -0800, bradwiseathome at hotmail.com (bdwise)
> wrote:
> 
>>";([^.]*).focus()"

Just thought I'd expand on what you've done above and what Dave suggests
below as I remember what it was like when I was new to REs:

1) . means 'match any one character'
2) ^ at the start of a character set [] means 'match anything except the
following set of characters.
3) ( and ) are used to a) group expressions together, and b) collect one or
more parts of expressions.

So 1) and 2) mean that '[^.]*' is meaningless as far as REs are concerned.
It indicates something like "match 0 or more of anything that's not one of
any character".

And 3) indicates that, as Dave said:

> You need to escape the metacharacters.  Try
> r";.*\.focus\(\).*;"

That's why he's suggested putting '\' in front of the '(' and ')' when you
want to literally search for those characters. 'Escaping' them in this way
means "match exactly this character, ignoring its usual meaning" where the
usual meaning of '(' and ')' would be "group the pattern between the '('
and ')' and store it for later retrieval.

You'll also notice that Dave has escaped the '.' before 'focus' as you want
to match a literal '.' rather than have it mean "match any character".

> Also, use a raw quote, so you don't have to escape the escapes.

Dave has put an 'r' at the front of the string. You may know that Python
interprets certain things in a string. For example, '\n' in a string gets
turned into a newline character. It's always best to make your REs r"raw
strings" in this way so that those special characters are no longer special
and have their regular, literal meaning.

> Don't forget to set re.DOTALL if you want the '.*' to capture newlines
> also.

Matches are normally constrained to be wholly within a single 'paragraph'
(between newlines). If the text you're searching for will only ever be
wholly on a single line, you don't need to set re.DOTALL.

Given everything above, Dave's RE:

r";.*\.focus\(\).*;"

means "match a single ';' followed by 0 or more of any character followed by
a literal '.' then the letters 'focus', a literal '(' and ')', followed by
0 or more of any character followed by a ';'. Which is more or less what
you want, but not exactly...

There is another thing you should be aware of: RE matching tends to be
'greedy'. That is, if a string that matches part of your pattern occurs
more than once within the search area, it will match the rightmost
occurrence of that part of the pattern. For example, "a.*f" matched against
the string "The big, bad wolf scared him off." will match from the 'a' in
'bad' right up to the 2nd 'f' in 'off' even though there are two other 'f's
before it. Matches are greedy by default and they can swallow up far more
than you intended.

There is a way of making these 'wildcard' RE characters non-greedy, which is
by putting a '?' after them. So, you could alter Dave's RE slightly:

r";.*?\.focus\(\).*?;"

to make sure that the first occurence of '.focus' is matched, followed by
the very next occurrence of ';'.

Dave's RE will match from the first ';' it finds right up to the ';' after
'focus()'. If you run it on the text you gave in your example, it will turn
this:

something();something();
document.thisForm.textBox1.focus();something();

into this:

something();
something();

In other words, it will match too much. If you think about it, what you want
to do is match from the character *after* a ';' up to the ';' following
'focus()'. A good RE for doing this is this:

r"[^;]*?\.focus\(\).*?;"

You'll see that it's a slight alteration to Dave's RE. It first checks for
the shortest sequence of characters that don't include a ';' and that are
immediately followed by '.focus();' - and that's what you want and that's
*all* that you want.

So you can do something like this:

line = re.sub(r"[^;]*?\.focus\(\).*?;", "", linefromfile)

and line will either be the same as linefromfile if the RE didn't match, or
it will be linefromfile with the matching text ("anything.focus();")
deleted.

I hope this helps and that I haven't "simplicated" things...  :o)

-- 
Garry Knight
garryknight at gmx.net  ICQ 126351135
Linux registered user 182025



More information about the Python-list mailing list