re.findall() is skipping matching characters
Fredrik Lundh
fredrik at pythonware.com
Mon Oct 15 17:21:21 EDT 2001
Gustaf Liljegren wrote:
> But look what happens when I use the findall() function:
>
> >>> re.findall(r'<(a)', '<a href="page.html">')
> ['a']
>
> Why does findall() skip the '<'? I want to sort out full strings like '<a
> href="page.html">' or '<area ... href="page.html">' and put them in a list.
> I imagine the full regex should look something like this according to
> today's standards:
>
> re_link = re.compile(r'<(a|area)\s[^>]*href[^>]*/?>', re.I | re.M)
>
> Where's the problem?
from the re.findall documentation: "If one or more groups are
present in the pattern, return a list of groups; this will be a list
of tuples if the pattern has more than one group"
try using a non-capturing group instead: (?:x) instead of (x)
or better, use the right tool for the task: sgmllib
import sgmllib
class Parser(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.hrefs = []
def href(self, attrib):
for k, v in attrib:
if k == "href":
self.hrefs.append(v)
break
do_a = do_area = href
p = Parser()
p.feed("some html text")
p.close()
print p.hrefs
</F>
<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list