HTMLParser.HTMLParseError: EOF in middle of construct

sergio sergio at sergiomb.no-ip.org
Wed Jun 20 09:13:23 EDT 2007


Rob Wolfe wrote:

> 
> Sérgio Monteiro Basto wrote:
>> Stefan Behnel wrote:
>>
>> > Sérgio Monteiro Basto wrote:
>> >> but is one single error that blocks this.
>> >> Finally I found it , it is :
>> >> <td colspan="2"align="center"
>> >> if I put :
>> >> <td colspan="2" align="center"
>> >>
>> >> p = re.compile('"align')
>> >> content = p.sub('" align', content)
>> >>
>> >> I can parse the html
>> >> I don't know if it a bug of HTMLParser
>> >
>> > Sure, and next time your key doesn't open your neighbours house, please
>> > report to the building company to have them fix the door.
>> >
>>
>> The question, here, is if
>> <td colspan="2"align="center"
>> is valid HTML or not ?
>> I think is valid , if so it's a bug on HTMLParser
> 
> According to the HTML 4.01 specification this is *not valid* HTML.
> 
> """
> Elements may have associated properties, called attributes, which may
> have values
> (by default, or set by authors or scripts). Attribute/value pairs
> appear before the final
> ">" of an element's start tag. Any number of (legal) attribute value
> pairs, separated
> by spaces, may appear in an element's start tag.
> """
> 
>> if not, we still have a very bad message error (EOF in middle of
>> construct !?)
> 
> HTMLParser can deal with some errors e.g. lack of ending tags,
> but it can't handle many other problems.
> 
>> I have to use HTMLParser because I want use only python 2.4 standard , I
>> have to install the scripts in many machines.
>> And I have to parse many different sites, I just want extract the links,
>> so with a clean up before parse solve very quickly my problem.
> 
> In Python 2.4 you have to use some third party module. There is no
> other option for _invalid_ HTML. IMHO BeautifulSoup is the best among
> them.
> 

Many thanks Rob , you have been clear has water thanks, 

> --
> HTH,
> Rob

-- 
Best regards,
--
Sérgio M. B. 



More information about the Python-list mailing list