Convert from unicode chars to HTML entities
Leif K-Brooks
eurleif at ecritters.biz
Mon Jan 29 00:40:06 EST 2007
Steven D'Aprano wrote:
> A few issues:
>
> (1) It doesn't seem to be reversible:
>
>>>> '© and many more...'.decode('latin-1')
> u'© and many more...'
>
> What should I do instead?
Unfortunately, there's nothing in the standard library that can do that,
as far as I know. You'll have to write your own function. Here's one
I've used before (partially stolen from code in Python patch #912410
which was written by Aaron Swartz):
from htmlentitydefs import name2codepoint
import re
def _replace_entity(m):
s = m.group(1)
if s[0] == u'#':
s = s[1:]
try:
if s[0] in u'xX':
c = int(s[1:], 16)
else:
c = int(s)
return unichr(c)
except ValueError:
return m.group(0)
else:
try:
return unichr(name2codepoint[s])
except (ValueError, KeyError):
return m.group(0)
_entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
def unescape(s):
return _entity_re.sub(_replace_entity, s)
> (2) Are XML entities guaranteed to be the same as HTML entities?
XML defines one entity which doesn't exist in HTML: '. But
xmlcharrefreplace only generates numeric character references, and those
should be the same between XML and HTML.
> (3) Is there a way to find out at runtime what encoders/decoders/error
> handlers are available, and what they do?
From what I remember, that's not possible because the codec system is
designed so that functions taking names are registered instead of the
names themselves. But all of the standard codecs are documented at
<http://python.org/doc/current/lib/standard-encodings.html>, and all of
the standard error handlers are documented at
<http://python.org/doc/current/lib/codec-base-classes.html>.
More information about the Python-list
mailing list