HTMLParser bug ?

Anand Pillai pythonguy at Hotpop.com
Mon May 12 10:26:05 EDT 2003


Continuing on the subject, there was this bug reported
on HTMLParser for tags where the attribute  contains
quotes (single or double) in the attribute value. For
example,

(Quoted from url http://www.python.org/doc/2.3b1/whatsnew/index.html)

<link rel="first" href="whatsnew23.html" title='What's New in Python 2.3'>

The attribute title contains a single quote inside its value.
HTMLParser bails out at this point giving the error,

"EOF in middle of construct, line 7,column 54"

The problem is with the regular expression that matches the
attribute value namely, 
<re>
\'[^\']*\'|"[^"]*"
</re>

This re wont allow any ' or " character inside the attribute
value. This re occurs for two re objects in the module, one
to find the end tag for a start tag (locatestarttagend) and
another to find the attribute value in a tag (attrfind).

If you replace this re with
<re>
\'.[^>]*\'|".[^>]*"
</re>

This will match an attribute value which has any sequence
of characters until a '> or a "> which should indicate
the logical end of a tag for an attribute value which we
should look for than just the ' or " alone. This
allows to have attributes like the above example, and
still continue the parsing. 

I have found that this works for a number of examples I
tried without causing other bugs. 

I havent still tried mxTidy... :-)

Anand Pillai

pythonguy at Hotpop.com (Anand Pillai) wrote in message news:<84fc4588.0305100227.36f8546 at posting.google.com>...
> Well, no arguments on this. I am not an html coder
> professionally. Only when I ran my page through the parser
> did I find out that the html in it is wrong! . I am a C++
> programmer and python hobbyist.
> 
>   Btw thanks for pointing out the problem with my webpage.
> Something surely seems to have gone wrong and the index.html
> file is missing :-(. Let me fix that first.
> 
> Anand Pillai
> 
> 
> Nick Vargish <nav at adams.patriot.net> wrote in message news:<yyyd6irew6y.fsf at adams.patriot.net>...
> > pythonguy at Hotpop.com (Anand Pillai) writes:
> > 
> > > contains some invalid HTML. I have seen many pages with
> > > this kind of code (my own homepage for example ;-)).
> > 
> >   [ ... ]  
> > 
> > > Anand Pillai
> > > http://members.fortunecity.com/anandpillai
> > 
> > I get a 404 for that page. BTW, if you know your HTML doesn't match
> > the published standard, you could try fixing it. :^)
> > 
> > Nick




More information about the Python-list mailing list