Turning HTMLParser into an iterator

Stefan Behnel stefan_ml at behnel.de
Mon Jun 1 02:38:52 EDT 2009


samwyse wrote:
> I'm processing some potentially large datasets stored as HTML.  I've
> subclassed HTMLParser so that handle_endtag() accumulates data into a
> list, which I can then fetch when everything's done.  I'd prefer,
> however, to have handle_endtag() somehow yield values while the input
> data is still streaming in.  I'm sure someone's done something like
> this before, but I can't figure it out.  Can anyone help?  Thanks.

If you can afford stepping away from HTMLParser, you could give lxml a try.
Its iterparse() function supports HTML parsing.

http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

Stefan



More information about the Python-list mailing list