Question regarding handling of Unicode data in Devnagari

Sat Sep 12 17:23:11 EDT 2009

"joy99" <subhakolkata1234 at gmail.com> wrote in message 
news:fade868b-6a69-4b74-a8e8-9c28a16174d9 at p10g2000prm.googlegroups.com...
> Dear Group,
>
> As per the standard posted by the UNICODE for the Devnagari script
> used for Hindi and some other languages of India, we have a standard
> set, like from the range of 0900-097F.
> Where, we have numbers for each character:
> like 0904 for Devnagari letter short a, etc.
> Now, if write a program,
>
> where
> ch="0904"
> and I like to see the Devnagari letter short a as output then how
> should I proceed? Can codecs help me or should I use unicodedata?

Here are a number of ways to generate a Unicode character.  Displaying them 
is another matter.  My newsreader program could display them properly but my 
the interactive window in my Python editor could not.

c = unichr(0x904)
print c,unicodedata.name(c)
print u'\N{DEVANAGARI LETTER SHORT A}'
print u'\u0904'
print u''.join(unichr(c) for c in range(0x900,0x980))

OUTPUT
ऄ DEVANAGARI LETTER SHORT A
ऄ
ऄ
ऀँंःऄअआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढणतथदधनऩपफबभमयरऱलळऴवशषसहऺऻ़ऽािीुूृॄॅॆेैॉॊोौ्ॎॏॐ॒॑॓॔ॕॖॗक़ख़ग़ज़ड़ढ़फ़य़ॠॡॢॣ।॥०१२३४५६७८९॰ॱॲॳॴॵॶॷॸॹॺॻॼॽॾॿ

If you use an editor that can write Devnagari and save in an encoding such 
as UTF-8, you can write Devnagari directly in the editor.  You only need to 
tell Python what encoding the source code is in.  You'll also need a 
terminal and know the encoding it uses for display of characters to actually 
see the correct character.  For example, below is a program written using 
Pythonwin from the pywin32 extensions (version 214).  It can write programs 
in most encodings and its interactive window supports UTF-8.

I can type Chinese and my fonts support it so I'll use that in this example. 
This message is sent in UTF-8 so hopefully it displays properly for you.

# coding: gbk
encoded_text = '你好！你在干什么？'
Unicode_text = u'你好！你在干什么？'
print encoded_text
print encoded_text.decode('gbk')
print Unicode_text
print Unicode_text.encode('utf-8')

OUTPUT:
ţۃáţ՚ىʲôÿ
你好！你在干什么？
你好！你在干什么？
你好！你在干什么？

'encoded_text' is a byte string encoded in the encoding the file is saved in 
(*not*what the #coding line declares...*you* have to make sure they agree!). 
Since my terminal is UTF-8, The gbk-encoded line is garbage.

The 2nd line should be correct because it decoded the byte string to 
Unicode.  'print' will automatically encode Unicode text in the terminal's 
encoding.  As long as the terminal's encoding and font supports the Unicode 
characters used (which in Pythonwin it does), the line will be correct.

The 3rd line works for the same reason the 2nd line does...The string is 
already Unicode.

The 4th line works because it was explicitly encoded into UTF-8, and the 
terminal supports it.

I hope this is useful to you.
-Mark