HTML File Parsing

Mon Dec 1 12:30:42 EST 2008

On Oct 28, 3:18 pm, Stefan Behnel <stefan... at behnel.de> wrote:
> Felipe De Bene wrote:
> > I'm having problems parsing an HTML file with the following syntax :
>
> > <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
> >     <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
> >     <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
> > BGCOLOR='#c0c0c0'>Date</TH>
> > and so on....
>
> > whenever I feed the parser with such file I get the error :
>
> > HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
> > line 515, column 45
>
> Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
> for parsing broken HTML. However, you can use the parse of lxml.html to fix up
> your HTML for you.
>
> http://codespeak.net/lxml/
>
> Stefan

It doesn't just choke on bad HTML, it also chokes on javascript that
writes HTML, e.g.  document.write('<scr'+'ipt language="javascript1.1"
src="http:/... will also result in an error.

However, when I did:

parser = aqparser() #An implementation of HTMLParser
parser.CDATA_CONTENT_ELEMENTS = ()

it worked. Strange...

-Peter