HTML File Parsing

Stefan Behnel stefan_ml at behnel.de
Tue Oct 28 16:18:33 EDT 2008


Felipe De Bene wrote:
> I'm having problems parsing an HTML file with the following syntax :
> 
> <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
>     <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
>     <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
> BGCOLOR='#c0c0c0'>Date</TH>
> and so on....
> 
> whenever I feed the parser with such file I get the error :
> 
> HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
> line 515, column 45

Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
for parsing broken HTML. However, you can use the parse of lxml.html to fix up
your HTML for you.

http://codespeak.net/lxml/

Stefan



More information about the Python-list mailing list