Convert from unicode chars to HTML entities
Gabriel Genellina
gagsl-py at yahoo.com.ar
Sun Jan 28 23:38:39 EST 2007
En Mon, 29 Jan 2007 00:05:24 -0300, Steven D'Aprano
<steve at REMOVEME.cybersource.com.au> escribió:
> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "© and many more..."
>
Module htmlentitydefs contains the tables you're looking for, but you need
a few transforms:
<code>
# -*- coding: iso-8859-15 -*-
from htmlentitydefs import codepoint2name
unichr2entity = dict((unichr(code), u'&%s;' % name)
for code,name in codepoint2name.iteritems()
if code!=38) # exclude "&"
def htmlescape(text, d=unichr2entity):
if u"&" in text:
text = text.replace(u"&", u"&")
for key, value in d.iteritems():
if key in text:
text = text.replace(key, value)
return text
print '%r' % htmlescape(u'hello')
print '%r' % htmlescape(u'"©® áé&ö <²³>')
</code>
Output:
u'hello'
u'"©® áé&ö <²³>'
The result is an unicode object, with all known entities replaced. It does
not handle missing, unknown entities - as the docs for htmlentitydefs say,
"the definition provided here contains all the entities defined by XHTML
1.0 that can be handled using simple textual substitution in the Latin-1
character set (ISO-8859-1)."
--
Gabriel Genellina
More information about the Python-list
mailing list