re.findall() is skipping matching characters

Mon Oct 15 17:51:22 EDT 2001

On Mon, 15 Oct 2001, Gustaf Liljegren wrote:

> Thanks for helping me out with matching/searching before. Unfortunately,
> the example I gave was a little too basic, so I need some more help.
>
> >>> re.search(r'<(a)', '<a href="page.html">').group()
> '<a'
>
> The search() function matches the full expression: both the '<' and the
> '(a)', which is short for a alternation between more HTML elements. The
> match() function behaves like this too:
>
> >>> re.match(r'<(a)', '<a href="page.html">').group()
> '<a'
>
> But look what happens when I use the findall() function:
>
> >>> re.findall(r'<(a)', '<a href="page.html">')
> ['a']
>
> Why does findall() skip the '<'? I want to sort out full strings like '<a
> href="page.html">' or '<area ... href="page.html">' and put them in a list.
> I imagine the full regex should look something like this according to
> today's standards:
>
> re_link = re.compile(r'<(a|area)\s[^>]*href[^>]*/?>', re.I | re.M)
>
> Where's the problem?

It's because <match>.group() takes an optional parameter specifying which
subroup to return, defaulting with 0, which specifies the entire match. Pass a
1 instead.

-- 
Ignacio Vazquez-Abrams  <ignacio at openservices.net>

   "As far as I can tell / It doesn't matter who you are /
    If you can believe there's something worth fighting for."
       - "Parade", Garbage