[Tutor] finding mismatched or unpaired html tags

Stefan Behnel stefan_ml at behnel.de
Tue Apr 28 19:39:17 CEST 2009


A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> 
> Don't use elementTree, use BeautifulSoup instead.

Actually, now that the code is there anyway, the OP might be happier with
lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and
often parses broken HTML better. It's also more user friendly for many HTML
tasks.

http://codespeak.net/lxml/lxmlhtml.html

This might also be worth a read:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Stefan



More information about the Tutor mailing list