Python3 html.parser

Peter Otten __peter__ at web.de
Tue Mar 18 07:44:24 EDT 2014


balaji marisetti wrote:

> Hi,
> 
> I'm trying to parse a pice of HTML code using `html.parser` in Python3.
> I want to find out the offset of a particular end tag (let's say </p>) and
> then stop processing
> the remaining HTML code immediately. So I wrote something like this.
> 
> [code]
> def handle_endtag(self, tag):
>     if tag == mytag:
>         #do something
>         self.reset()
> [code]
> 
> I called `reset()` method at the end of  `handle_endtag()` method. Now the
> problem is: when I call parser.feed("some html"), it's giving an
> "AssertionError" exception. Isn't the `reset()` method
> supposed to be called inside "handler" methods?

Obviously not ;) After looking into the code I think there is no controlled 
way to stop parsing. I suggest that you raise a custom exception instead:

import html.parser

class TagFound(Exception):
    pass

class MyParser(html.parser.HTMLParser):
    def handle_endtag(self, tag):
        if tag == wanted_tag:
            raise TagFound

wanted_tag = "a"
parser = MyParser()
for data in ["<html><body><a></a></body></html>",
             "<html><body><b></b></body></html>"]:
    try:
        parser.feed(data)
    except TagFound:
        print("tag {!r} found".format(wanted_tag))
    else:
        print("tag {!r} not found".format(wanted_tag))
    parser.reset()





More information about the Python-list mailing list