converting html escape sequences to unicode characters

Fri Dec 10 03:09:44 EST 2004

On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8.  Stuff like:

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape
u'\ube44'
>>> print uescape
ë¹„
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

>>> entities = ['비', '행', '기', '로',
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠']
>>> def unescape(escapeseq):
...     return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
...
>>> print ' '.join([ unescape(x) for x in entities ])
ë¹„ í–‰ ê¸° ë¡œ ë³´ ë‚¼ ê±° ì— ìš” ë‚´ ë©´ ê¸ˆ ì´ ì–¼ ë§ˆ ì§€ ìž 

--
Craig Ringer