HTMLParser.HTMLParseError: EOF in middle of construct

Rob Wolfe rw at smsnet.pl
Wed Jun 20 03:07:39 EDT 2007


Sérgio Monteiro Basto wrote:
> Stefan Behnel wrote:
>
> > Sérgio Monteiro Basto wrote:
> >> but is one single error that blocks this.
> >> Finally I found it , it is :
> >> <td colspan="2"align="center"
> >> if I put :
> >> <td colspan="2" align="center"
> >>
> >> p = re.compile('"align')
> >> content = p.sub('" align', content)
> >>
> >> I can parse the html
> >> I don't know if it a bug of HTMLParser
> >
> > Sure, and next time your key doesn't open your neighbours house, please
> > report to the building company to have them fix the door.
> >
>
> The question, here, is if
> <td colspan="2"align="center"
> is valid HTML or not ?
> I think is valid , if so it's a bug on HTMLParser

According to the HTML 4.01 specification this is *not valid* HTML.

"""
Elements may have associated properties, called attributes, which may
have values
(by default, or set by authors or scripts). Attribute/value pairs
appear before the final
">" of an element's start tag. Any number of (legal) attribute value
pairs, separated
by spaces, may appear in an element's start tag.
"""

> if not, we still have a very bad message error (EOF in middle of
> construct !?)

HTMLParser can deal with some errors e.g. lack of ending tags,
but it can't handle many other problems.

> I have to use HTMLParser because I want use only python 2.4 standard , I
> have to install the scripts in many machines.
> And I have to parse many different sites, I just want extract the links, so
> with a clean up before parse solve very quickly my problem.

In Python 2.4 you have to use some third party module. There is no
other option for _invalid_ HTML. IMHO BeautifulSoup is the best among
them.

--
HTH,
Rob




More information about the Python-list mailing list