[Python-bugs-list] [ python-Bugs-423779 ] sgmllib.py not good at handling <br/>

noreply@sourceforge.net noreply@sourceforge.net
Wed, 16 May 2001 08:54:48 -0700


Bugs item #423779, was updated on 2001-05-13 13:28
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=423779&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Chris Withers (fresh)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: sgmllib.py not good at handling <br/>

Initial Comment:
When parsing the following HTML:

'Roses <b>are</B> red,<br/>violets <i>are</i> blue'

...with the following class:

class HTML2SafeHTML(sgmllib.SGMLParser):
    
    def handle_data(self, data):
        print "***data***"
        print data

    def unknown_starttag(self, tag, attrs):
        print "***start**"
        print tag
        pprint (attrs)
        pprint (self.openTags)
                
    def unknown_endtag(self, tag):
        print "***end**"
        print tag
        pprint (self.openTags)

I get the following output, which isn't right :-S

***data***
Roses
***start**
b
[]
[]
***data***
are
***end**
b
['b']
***data***
 red,
***start**
br
[]
[]
***data***
>violets <i>are<
***end**
br
[]
***data***
i> blue

cheers,

Chris

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-05-16 08:54

Message:
Logged In: YES 
user_id=3066

While there is definately space for improvement in sgmllib, and probably a need for a few bug fixes, it is not clear that this is one of the bugs.

SGML defines something called the "null end tag" (NET) and the "NET enabler".  In a document, this looks like:

    <tag/ content /

This represents an element "tag" with the content " content ".  The first slash is the enabler and the second is the NET.  Basic support for this has been a part of sgmllib for as long as I can remember; it was added before I started playing with it.

In practice, use of the NET in HTML doesn't seem to exist.  Perhaps it should be something that can be specifically enabled or disabled?  I'm more inclined to say that sgmllib should not be used for XHTML though -- XHTML is *not* SGML, it's XML, and that's something different.

Have you tried to apply xmllib to your application?  Given your desire (stated elsewhere) to work with seriously broken HTML as well, you may be better off using a custom parser similar to TAL.HTMLParser used by PageTemplates.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=423779&group_id=5470