[Tutor] finding mismatched or unpaired html tags

Tue Apr 28 16:14:22 CEST 2009

Dinesh B Vadhia wrote:
> A.T. / Marty
>  
> I'd prefer that the html parser didn't replace the missing tags as I
> want to know where and what the problems are.  Also, the source html
> documents were generated by another computer ie. they are not web page
> documents.  My sense is that it is only a few files out of tens of
> thousands.  Cheers ...
>  
> Dinesh

If this is a one time task, write a script to iterate over the html
files, and collect the traceback info from those that throw a
'mismatched tag' error. Based on your example below, it appears to
contain the line number. You'd only get one error per file per run, but
you can run it until there are no errors remaining. I hope that makes
sense.

HTH,
Marty

>  
>  
> ------------------------------------------------------------------------
> Message: 7
> Date: Tue, 28 Apr 2009 08:54:33 -0500
> From: Martin Walsh <mwalsh at mwalsh.org>
> Subject: Re: [Tutor] finding mismatched or unpaired html tags
> To: "tutor at python.org" <tutor at python.org>
> Message-ID: <49F70A99.3050002 at mwalsh.org>
> Content-Type: text/plain; charset=us-ascii
> 
> A.T.Hofkamp wrote:
>> Dinesh B Vadhia wrote:
>>> I'm processing tens of thousands of html files and a few of them
>>> contain mismatched tags and ElementTree throws the error:
>>>
>>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>>> line 124, column 8"
>>>
>>> I now want to scan each file and simply identify each mismatched or
>>> unpaired
>> tags (by line number) in each file. I've read the ElementTree docs and
>> cannot
>> see anything obvious how to do this. I know this is a common problem but
>> feeling a bit clueless here - any ideas?
>>>
>>
>> Don't use elementTree, use BeautifulSoup instead.
>>
>> elementTree expects perfect input, typically generated by another
> computer.
>> BeautifulSoup is designed to handle your everyday HTML page, filled with
>> errors of all possible kinds.
> 
> But it also modifies the source html by default, adding closing tags,
> etc. Important to know, I suppose, if you intend to re-write the html
> files you parse with BeautifulSoup.
> 
> Also, unless you're running python 3.0 or greater, use the 3.0.x series
> of BeautifulSoup -- otherwise you may run into the same issue.
> 
> http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
> 
> HTH,
> Marty
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor