HTMLParser.HTMLParseError: EOF in middle of construct

Tue Jun 19 02:22:44 EDT 2007

Sergio Monteiro Basto wrote:
> Can someone explain me, what is wrong with this site ?
> 
> python linkExtractor3.py http://www.noticiasdeaveiro.pt > test
> 
> HTMLParser.HTMLParseError: EOF in middle of construct, at line 1173,
> column 1
> 
> at line 1173 of test file is perfectly normal .
> 
> I like to know what I have to clean up before parse the html page 
> I send in attach the python code .

You don't want to do these things with HTMLParser. lxml is much easier to use
and supports broken HTML (as in the page you're parsing).

Note that there is a SVN branch of lxml that comes with an html package
(lxml.html) that provides a "clean()" function. Just parse the page with the
HTML parser provided by the package (a few lines), then call the clean()
function on it with the parameters you want to get rid of scripts and the like.

The docs:
http://codespeak.net/lxml/dev/

The SVN branch:
http://codespeak.net/svn/lxml/branch/html/

You seem to be on Linux, so compiling lxml should be simple enough:
http://codespeak.net/lxml/dev/build.html#subversion

Have fun,
Stefan