[Patches] [ python-Patches-590682 ] New codecs: html, asciihtml

Sun, 04 Aug 2002 04:10:22 -0700

Patches item #590682, was opened at 2002-08-04 04:58
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Oren Tirosh (orenti)
Assigned to: Nobody/Anonymous (nobody)
Summary: New codecs: html, asciihtml

Initial Comment:
These codecs translate HTML character &entity; 
references.

The html codec may be applied after other codecs such 
as utf-8 or iso8859_X and preserves their encoding.  The 
asciihtml encoder produces 7-bit ascii and its output is 
therefore safe for insertion into almost any document 
regardless of its encoding.

----------------------------------------------------------------------

>Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 11:10

Message:
Logged In: YES 
user_id=562624

PEP 293 and patch #432401 are not a replacement for these 
codecs - it does decoding as well as encoding and also 
translates <, >, and & which are valid in all encodings and 
therefore won't get translated by error callbacks.

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 11:00

Message:
Logged In: YES 
user_id=562624

Yes, the error callback approach handles strange mixes 
better than my method of chaining codecs. But it only does 
encoding - this patch also provides full decoding of named, 
decimal and hexadecimal character entity references.

Assuming PEP 293 is accepted, I'd like to see the asciihtml 
codec stay for its decoding ability and renamed to xmlcharref. 
The encoding part of this codec can just call .encode("ascii", 
errors="xmlcharrefreplace") to make it a full two-way codec.

I'd prefer htmlentitydefs.py to use unicode, too. It's not so 
useful the way it is.  Another problem is that it uses mixed 
case names as keys. The dictionary lookup is likely to miss 
incoming entities with arbitrary case so it's more-or-less 
broken. Does anyone actually use it the way it is? Can it be 
changed to use unicode without breaking anyone's code?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 08:54

Message:
Logged In: YES 
user_id=21627

This patch is superceded by PEP 293 and patch #432401, which
allows you to write

unitext.encode("ascii", errors = "xmlcharrefreplace")

This probably should be left open until PEP 293 is
pronounced upon, and then either rejected or reviewed in detail.

I'd encourage a patch that uses Unicode in htmlentitydefs
directly, and computes entitydefs from that, instead of
vice-versa (or atleast exposes a unicode_entitydefs, perhaps
even lazily) - perhaps also with a reverse mapping.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470