[ python-Bugs-1459279 ] sgmllib.SGMLparser and hexadecimal numeric character refs

SourceForge.net noreply at sourceforge.net
Mon Mar 27 14:51:59 CEST 2006


Bugs item #1459279, was opened at 2006-03-27 14:51
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1459279&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Francesco Ricciardi (nerby)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib.SGMLparser and hexadecimal numeric character refs

Initial Comment:
According to HTML 4.0 specification it is possible to
have hexadecimal numeric character references, not only
decimal (see
http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1).

However sgmllib.SGMLparser does not recognize the
hexadecimal form.

More and more HTML pages now use entities with a high
codepoint, not in the official HTML entity list, so
proper handling to these references should be implemented.

A possible solution could be:
- improving the "charref" regular expression, so to
include exadecimal values;
- considering all numeric references valid: those with
n < 255 should be converted to the corresponding
characters, those above 255 should be left as numerical
charrefs. 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1459279&group_id=5470


More information about the Python-bugs-list mailing list