unicode "table of character" implementation in python

"Martin v. Löwis" martin at v.loewis.de
Mon Aug 28 12:36:06 EDT 2006


Nicolas Pontoizeau schrieb:
> I am handling a mixed languages text file encoded in UTF-8. Theres is
> mainly French, English and Asian languages. I need to detect every
> asian characters in order to enclose it by a special tag for latex.
> Does anybody know if there is a unicode "table of character"
> implementation in python? I mean, I give a character and python replys
> me with the language in which the character occurs.

This is a bit unspecific, so likely, nothing that already exists will
be completely correct for your needs. If you need to escape characters
for latex, I would expect that there is a more precise specification
of what you need to escape - I doubt the fact that a character is used
primarily in Asia matters much to latex.

In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0750..077F; Arabic Supplement
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1780..17FF; Khmer
1800..18AF; Mongolian
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
2D00..2D2F; Georgian Supplement
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
F900..FAFF; CJK Compatibility Ideographs
FB50..FDFF; Arabic Presentation Forms-A
FE30..FE4F; CJK Compatibility Forms
FE70..FEFF; Arabic Presentation Forms-B
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement

Notice that some scripts are used both in Asia and elsewhere,
e.g. Latin and Cyrillic. Arabic probably doesn't belong in
this list, either, being used both in Asia and elsewhere
as the script of the official language.

Regards,
Martin



More information about the Python-list mailing list