[Python-bugs-list] [ python-Bugs-803422 ] sgmllib doesn't support hex or Unicode character references

SourceForge.net noreply at sourceforge.net
Tue Sep 9 15:00:34 EDT 2003


Bugs item #803422, was opened at 2003-09-09 15:53
Message generated for change (Comment added) made by aaronsw
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=803422&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Aaron Swartz (aaronsw)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib doesn't support hex or Unicode character references

Initial Comment:
sgmllib doesn't support the hexadecimal style of character nor 

Unicode characters, both of which are commonly seen on web pages. 

The following replacements fix both problems.



charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]')



	def handle_charref(self, ref):

		try:

			if ref[0] == 'x' or ref[0] == 'X': m = 

int(ref[1:], 16)

			else: m = int(ref)

			self.handle_data(unichr(m).encode('utf-8'))

		except ValueError:

			self.unknown_charref(ref)



----------------------------------------------------------------------

>Comment By: Aaron Swartz (aaronsw)
Date: 2003-09-09 16:00

Message:
Logged In: YES 
user_id=122141

Oops, that should be: 



charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=803422&group_id=5470



More information about the Python-bugs-list mailing list