[Tutor] Regular expression - I

Tue Feb 18 20:57:20 CET 2014

_____________________________
> From: Steve Willoughby <steve at alchemy.com>
>To: Santosh Kumar <rhce.san at gmail.com> 
>Cc: python mail list <tutor at python.org> 
>Sent: Tuesday, February 18, 2014 7:03 PM
>Subject: Re: [Tutor] Regular expression - I
> 
>
>Because the regular expression <H*> means “match an angle-bracket character, zero or more H characters, followed by a close angle-bracket character” and your string does not match that pattern.
>
>This is why it’s best to check that the match succeeded before going ahead to call group() on the result (since in this case there is no result).
>
>
>On 18-Feb-2014, at 09:52, Santosh Kumar <rhce.san at gmail.com> wrote:

You also might want to consider making it a non-greedy match. The explanation http://docs.python.org/2/howto/regex.html covers an example almost identical to yours:

Greedy versus Non-Greedy
When repeating a regular expression, as in a*, the resulting action is to
consume as much of the pattern as possible.  This fact often bites you when
you’re trying to match a pair of balanced delimiters, such as the angle brackets
surrounding an HTML tag.  The naive pattern for matching a single HTML tag
doesn’t work because of the greedy nature of .*.
>>>
>>> s = '<html><head><title>Title</title>' >>> len(s) 32 >>> print re.match('<.*>', s).span() (0, 32) >>> print re.match('<.*>', s).group() <html><head><title>Title</title> 
The RE matches the '<' in <html>, and the .* consumes the rest of
the string.  There’s still more left in the RE, though, and the > can’t
match at the end of the string, so the regular expression engine has to
backtrack character by character until it finds a match for the >.   The
final match extends from the '<' in <html> to the '>' in </title>, which isn’t what you want.
In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or {m,n}?, which match as little text as possible.  In the above
example, the '>' is tried immediately after the first '<' matches, and
when it fails, the engine advances a character at a time, retrying the '>' at every step.  This produces just the right result:
>>>
>>> print re.match('<.*?>', s).group() <html>