Convert from unicode chars to HTML entities

Sun Jan 28 23:38:39 EST 2007

En Mon, 29 Jan 2007 00:05:24 -0300, Steven D'Aprano  
<steve at REMOVEME.cybersource.com.au> escribió:

> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "© and many more..."
>

Module htmlentitydefs contains the tables you're looking for, but you need  
a few transforms:

<code>
# -*- coding: iso-8859-15 -*-
 from htmlentitydefs import codepoint2name

unichr2entity = dict((unichr(code), u'&%s;' % name)
     for code,name in codepoint2name.iteritems()
     if code!=38) # exclude "&"

def htmlescape(text, d=unichr2entity):
     if u"&" in text:
         text = text.replace(u"&", u"&")
     for key, value in d.iteritems():
         if key in text:
             text = text.replace(key, value)
     return text

print '%r' % htmlescape(u'hello')
print '%r' % htmlescape(u'"©® áé&ö <²³>')
</code>

Output:
u'hello'
u'"©® áé&ö <&sup2;&sup3;>'

The result is an unicode object, with all known entities replaced. It does  
not handle missing, unknown entities - as the docs for htmlentitydefs say,  
"the definition provided here contains all the entities defined by XHTML  
1.0 that can be handled using simple textual substitution in the Latin-1  
character set (ISO-8859-1)."

-- 
Gabriel Genellina