converting html escape sequences to unicode characters

Thu Dec 9 20:27:32 EST 2004

harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8.  Stuff like:
> 
> 비
> 행
> 기
> 로
> 보
> 낼
> 거
> 에
> 요
> 내
> 면
> 금
> 이
> 얼
> 마
> 지
> 잠
> 
> Anyone know what the decimal is representing?  It doesn't seem to
> equate to a unicode codepoint...

In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
     48708,
     54665,
     44592,
     47196,
     48372,
     45244,
     44144,
     50640,
     50836,
     45236,
     47732,
     44552,
     51060,
     50620,
     47560,
     51648,
     51104,
]

for num in nums:
     print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent