converting html escape sequences to unicode characters
Craig Ringer
craig at postnewspapers.com.au
Fri Dec 10 03:09:44 EST 2004
On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8. Stuff like:
I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:
>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape
u'\ube44'
>>> print uescape
ë¹
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).
I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.
>>> entities = ['비', '행', '기', '로',
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠']
>>> def unescape(escapeseq):
... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
...
>>> print ' '.join([ unescape(x) for x in entities ])
ë¹ í 기 ë¡ ë³´ ë¼ ê±° ì ì ë´ ë©´ ê¸ ì´ ì¼ ë§ ì§ ì
--
Craig Ringer
More information about the Python-list
mailing list