re.findall() is skipping matching characters

Mon Oct 15 17:21:21 EDT 2001

Gustaf Liljegren wrote:
> But look what happens when I use the findall() function:
>
> >>> re.findall(r'<(a)', '<a href="page.html">')
> ['a']
>
> Why does findall() skip the '<'? I want to sort out full strings like '<a
> href="page.html">' or '<area ... href="page.html">' and put them in a list.
> I imagine the full regex should look something like this according to
> today's standards:
>
> re_link = re.compile(r'<(a|area)\s[^>]*href[^>]*/?>', re.I | re.M)
>
> Where's the problem?

from the re.findall documentation: "If one or more groups are
present in the pattern, return a list of groups; this will be a list
of tuples if the pattern has more than one group"

try using a non-capturing group instead: (?:x) instead of (x)

or better, use the right tool for the task: sgmllib

import sgmllib

class Parser(sgmllib.SGMLParser):
    def __init__(self):
        sgmllib.SGMLParser.__init__(self)
        self.hrefs = []
    def href(self, attrib):
        for k, v in attrib:
            if k == "href":
                self.hrefs.append(v)
                break
    do_a = do_area = href

p = Parser()

p.feed("some html text")
p.close()

print p.hrefs

</F>

<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->