How to get the ascii code of Chinese characters?

Peter Maas peter.maas at somewhere.com
Sat Aug 19 15:54:36 EDT 2006


Gerhard Fiedler wrote:
> Well, ASCII can represent the Unicode numerically -- if that is what the OP
> wants.

No. ASCII characters range is 0..127 while Unicode characters range is
at least 0..65535.

> For example, "U+81EC" (all ASCII) is one possible -- not very
> readable though <g> -- representation of a Hanzi character (see
> http://www.cojak.org/index.php?function=code_lookup&term=81EC).

U+81EC means a Unicode character which is represented by the number
0x81EC. There are some encodings defined which map Unicode sequences
to byte sequences: UTF-8 maps Unicode strings to sequences of bytes in
the range 0..255, UTF-7 maps Unicode strings to sequences of bytes in
the range 0..127. You *could* read the latter as ASCII sequences
but this is not correct.

How to do it in Python? Let chinesePhrase be a Unicode string with
Chinese content. Then

chinesePhrase_7bit = chinesePhrase.encode('utf-7')

will produce a sequences of bytes in the range 0..127 representing
chinesePhrase and *looking like* a (meaningless) ASCII sequence.

chinesePhrase_16bit = chinesePhrase.encode('utf-16be')

will produce a sequence with Unicode numbers packed in a byte
string in big endian order. This is probably closest to what
the OP wants.

Peter Maas, Aachen



More information about the Python-list mailing list