Convert from unicode chars to HTML entities

Mon Jan 29 00:40:06 EST 2007

Steven D'Aprano wrote:
> A few issues:
> 
> (1) It doesn't seem to be reversible:
> 
>>>> '© and many more...'.decode('latin-1')
> u'© and many more...'
> 
> What should I do instead?

Unfortunately, there's nothing in the standard library that can do that, 
as far as I know. You'll have to write your own function. Here's one 
I've used before (partially stolen from code in Python patch #912410 
which was written by Aaron Swartz):

from htmlentitydefs import name2codepoint
import re

def _replace_entity(m):
     s = m.group(1)
     if s[0] == u'#':
         s = s[1:]
         try:
             if s[0] in u'xX':
                 c = int(s[1:], 16)
             else:
                 c = int(s)
             return unichr(c)
         except ValueError:
             return m.group(0)
     else:
         try:
             return unichr(name2codepoint[s])
         except (ValueError, KeyError):
             return m.group(0)

_entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
def unescape(s):
     return _entity_re.sub(_replace_entity, s)

> (2) Are XML entities guaranteed to be the same as HTML entities?

XML defines one entity which doesn't exist in HTML: '. But 
xmlcharrefreplace only generates numeric character references, and those 
should be the same between XML and HTML.

> (3) Is there a way to find out at runtime what encoders/decoders/error
> handlers are available, and what they do? 

 From what I remember, that's not possible because the codec system is 
designed so that functions taking names are registered instead of the 
names themselves. But all of the standard codecs are documented at 
<http://python.org/doc/current/lib/standard-encodings.html>, and all of 
the standard error handlers are documented at 
<http://python.org/doc/current/lib/codec-base-classes.html>.