Reg Exp: Need advice concerning "greediness"
Calvelo Daniel
dcalvelo at pharion.univ-lille2.fr
Mon Oct 2 06:17:39 EDT 2000
Franz GEIGER <fgeiger at datec.at> wrote:
: Hello all,
: I want to exchange font colors of headings of a certain level in HTML files.
: I have a line containing a heading level 1, e.g.: <h1><font
: COLOR="#FF0000">Heading Level 1</font></h1>.
: Now I want to split this into 3 groups: Everything before "COLOR=xyz",
: "COLOR=xyz" itself, and everything after "COLOR=xyz".
: I tried:
: sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
: print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
: re.S), sRslt);
Beware of quotes in your example:
>>> sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>"
>>> sRslt
'<h1><font COLOR='
(That explains weird results reported here)
As for your regexp, the following works:
>>> print re.findall(re.compile(r'(.*?FONT[^">]+?)(COLOR=.*?)?([ |>].*)', re.I | re.S), sRslt);
[('<h1><font ', 'COLOR="#FF0000"', '>Heading Level 1</font></h1>')]
I used a negated character class to force an end for the first group before
a cpossible COLOR tag. Otherwise, what I think is happening is that your
non-greedy search is indeed non-greedy, but the null-match of '(COLOR=.*?)*'
is included into it. BTW, I changed that '*' to '?', which is what you meant,
if I read correctly.
HTH, DCA
-- Daniel Calvelo Aros
calvelo at lifl.fr
More information about the Python-list
mailing list