[Patches] [ python-Patches-590682 ] New codecs: html, asciihtml
noreply@sourceforge.net
noreply@sourceforge.net
Sun, 04 Aug 2002 04:10:22 -0700
Patches item #590682, was opened at 2002-08-04 04:58
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Oren Tirosh (orenti)
Assigned to: Nobody/Anonymous (nobody)
Summary: New codecs: html, asciihtml
Initial Comment:
These codecs translate HTML character &entity;
references.
The html codec may be applied after other codecs such
as utf-8 or iso8859_X and preserves their encoding. The
asciihtml encoder produces 7-bit ascii and its output is
therefore safe for insertion into almost any document
regardless of its encoding.
----------------------------------------------------------------------
>Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 11:10
Message:
Logged In: YES
user_id=562624
PEP 293 and patch #432401 are not a replacement for these
codecs - it does decoding as well as encoding and also
translates <, >, and & which are valid in all encodings and
therefore won't get translated by error callbacks.
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 11:00
Message:
Logged In: YES
user_id=562624
Yes, the error callback approach handles strange mixes
better than my method of chaining codecs. But it only does
encoding - this patch also provides full decoding of named,
decimal and hexadecimal character entity references.
Assuming PEP 293 is accepted, I'd like to see the asciihtml
codec stay for its decoding ability and renamed to xmlcharref.
The encoding part of this codec can just call .encode("ascii",
errors="xmlcharrefreplace") to make it a full two-way codec.
I'd prefer htmlentitydefs.py to use unicode, too. It's not so
useful the way it is. Another problem is that it uses mixed
case names as keys. The dictionary lookup is likely to miss
incoming entities with arbitrary case so it's more-or-less
broken. Does anyone actually use it the way it is? Can it be
changed to use unicode without breaking anyone's code?
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 08:54
Message:
Logged In: YES
user_id=21627
This patch is superceded by PEP 293 and patch #432401, which
allows you to write
unitext.encode("ascii", errors = "xmlcharrefreplace")
This probably should be left open until PEP 293 is
pronounced upon, and then either rejected or reviewed in detail.
I'd encourage a patch that uses Unicode in htmlentitydefs
directly, and computes entitydefs from that, instead of
vice-versa (or atleast exposes a unicode_entitydefs, perhaps
even lazily) - perhaps also with a reverse mapping.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470