[Tutor] finding mismatched or unpaired html tags

Dinesh B Vadhia dineshbvadhia at hotmail.com
Wed Apr 29 00:00:12 CEST 2009


Stefan / Alan et al

Thank-you for all the advice and links.  A simple script using etree is scanning 500K+ xhtml files and 2 files with mismatched files have been found so far which can be fixed manually.  I'll definitely look into "tidy" as it sounds pretty cool.  Because, we are running data processing programs on a 64-bit Windows box (yes, I know, I know ...) using 64-bit Python we can only use pure Python-only libraries.  I believe that lxml uses C libraries.  Again, thanks to everyone - a terrific community as usual!



--------------------------------------------------------------------------------

Message: 5
Date: Tue, 28 Apr 2009 19:39:17 +0200
From: Stefan Behnel <stefan_ml at behnel.de>
Subject: Re: [Tutor] finding mismatched or unpaired html tags
To: tutor at python.org
Message-ID: <gt7f05$1ov$1 at ger.gmane.org>
Content-Type: text/plain; charset=ISO-8859-1

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> 
> Don't use elementTree, use BeautifulSoup instead.

Actually, now that the code is there anyway, the OP might be happier with
lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and
often parses broken HTML better. It's also more user friendly for many HTML
tasks.

http://codespeak.net/lxml/lxmlhtml.html

This might also be worth a read:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090428/8392ca10/attachment.htm>


More information about the Tutor mailing list