[ python-Bugs-1117302 ] sgmllib.SGMLParser

SourceForge.net noreply at sourceforge.net
Tue Feb 8 09:03:36 CET 2005


Bugs item #1117302, was opened at 2005-02-06 15:04
Message generated for change (Comment added) made by effbot
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1117302&group_id=5470

Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Birnie (pbirnie)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib.SGMLParser

Initial Comment:
sgmllib.SGMLParser calls start tag and end_methods 
correctly until it encounters

        <a title="link1" href="url1">One</a>
        <br/><a title="link2" href="someurl2">Two</a>
        <a title="link2" href="url3">Three</a> 

the <br/> seems to cause its parsing to become 
confused and I conly get call backs for tag a twice (link 
1 and 3)
  



----------------------------------------------------------------------

>Comment By: Fredrik Lundh (effbot)
Date: 2005-02-08 09:03

Message:
Logged In: YES 
user_id=38376

footnote 2: if you need to deal with broken HTML, use 
TidyLib:

http://utidylib.berlios.de/
http://effbot.org/zone/element-tidylib.htm

----------------------------------------------------------------------

Comment By: Fredrik Lundh (effbot)
Date: 2005-02-08 09:01

Message:
Logged In: YES 
user_id=38376

footnote: <br/> is an XML construct, and is not valid HTML.  
In HTML, "<tag/blah/" is short for "<tag>blah</tag>", so the 
BR section is parsed as

START br
DATA ><a title="link2" href="someurl2">Two<
END br
DATA a>

which is 100% correct.  For more on this topic, see:

http://www.cs.tut.fi/~jkorpela/html/empty.html

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1117302&group_id=5470


More information about the Python-bugs-list mailing list