converting html escape sequences to unicode characters

Craig Ringer craig at postnewspapers.com.au
Fri Dec 10 03:09:44 EST 2004


On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8.  Stuff like:

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape
u'\ube44'
>>> print uescape
비
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

>>> entities = ['비', '행', '기', '로',
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠']
>>> def unescape(escapeseq):
...     return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
...
>>> print ' '.join([ unescape(x) for x in entities ])
비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠

--
Craig Ringer




More information about the Python-list mailing list