sgmllib.py not good at handling <br/>

Chris Withers chrisw at nipltd.com
Mon May 14 08:22:13 EDT 2001


Hi,

I posted this to the bug Tracker:
http://sourceforge.net/tracker/?func=detail&aid=423779&group_id=5470&atid=105470

...but it's holding me up badly so I thought I'd ask here too in the hope that
one of you kind souls can help out :-)

When parsing the following HTML: 

'Roses <b>are</B> red,<br/>violets <i>are</i> blue' 

...with the following class: 

class HTML2SafeHTML(sgmllib.SGMLParser): 

def handle_data(self, data): 
	print "***data***" 
	print data 

def unknown_starttag(self, tag, attrs): 
	print "***start**" 
	print tag 
	print (attrs) 

def unknown_endtag(self, tag): 
	print "***end**" 
	print tag 

I get the following output, which isn't right :-S 

***data*** 
Roses 
***start** 
b 
[] 
***data*** 
are 
***end** 
b 
***data*** 
red, 
***start** 
br 
[] 
***data*** 
>violets <i>are< 
***end** 
br 
***data*** 
i> blue 

Any idea what's broken, where and how to fix it? I get the same with the
htmllib.py in both python 1.5.2, 2.0 and the latest from CVS.

cheers,

Chris




More information about the Python-list mailing list