Reg Exp: Need advice concerning "greediness"

Calvelo Daniel dcalvelo at pharion.univ-lille2.fr
Mon Oct 2 06:17:39 EDT 2000


Franz GEIGER <fgeiger at datec.at> wrote:
: Hello all,

: I want to exchange font colors of headings of a certain level in HTML files.

: I have a line containing a heading level 1, e.g.: <h1><font
: COLOR="#FF0000">Heading Level 1</font></h1>.

: Now I want to split this into 3 groups: Everything before "COLOR=xyz",
: "COLOR=xyz" itself, and everything after "COLOR=xyz".

: I tried:
: sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
: print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
: re.S), sRslt);

Beware of quotes in your example:

>>> sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>"
>>> sRslt 
'<h1><font COLOR='

(That explains weird results reported here)

As for your regexp, the following works:

>>> print re.findall(re.compile(r'(.*?FONT[^">]+?)(COLOR=.*?)?([ |>].*)', re.I | re.S), sRslt);
[('<h1><font ', 'COLOR="#FF0000"', '>Heading Level 1</font></h1>')]

I used a negated character class to force an end for the first group before
a cpossible COLOR tag.  Otherwise, what I think is happening is that your 
non-greedy search is indeed non-greedy, but the null-match of '(COLOR=.*?)*' 
is included into it. BTW, I changed that '*' to '?', which is what you meant, 
if I read correctly.

HTH, DCA

-- Daniel Calvelo Aros
     calvelo at lifl.fr



More information about the Python-list mailing list