HTML File Parsing
worldgnat
worldgnat at gmail.com
Mon Dec 1 12:30:42 EST 2008
On Oct 28, 3:18 pm, Stefan Behnel <stefan... at behnel.de> wrote:
> Felipe De Bene wrote:
> > I'm having problems parsing an HTML file with the following syntax :
>
> > <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
> > <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
> > <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
> > BGCOLOR='#c0c0c0'>Date</TH>
> > and so on....
>
> > whenever I feed the parser with such file I get the error :
>
> > HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
> > line 515, column 45
>
> Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
> for parsing broken HTML. However, you can use the parse of lxml.html to fix up
> your HTML for you.
>
> http://codespeak.net/lxml/
>
> Stefan
It doesn't just choke on bad HTML, it also chokes on javascript that
writes HTML, e.g. document.write('<scr'+'ipt language="javascript1.1"
src="http:/... will also result in an error.
However, when I did:
parser = aqparser() #An implementation of HTMLParser
parser.CDATA_CONTENT_ELEMENTS = ()
it worked. Strange...
-Peter
More information about the Python-list
mailing list