python 2.7 and unicode (one more time)

random832 at fastmail.us random832 at fastmail.us
Thu Nov 20 20:31:30 EST 2014


On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote:
> 2) Languages which use a different alphabet (eg Cyrillic - Russian,
> Bulgarian). You could possibly cram them into an eight-bit encoding
> without tipping ASCII out, but I'm not sure. In Unicode, these
> languages are all easily supported by the BMP, as they don't use a
> huge number of characters each.

There are numerous eight-bit encodings that support latin and one other
alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit
encoding is basically two seven-bit encodings.

The most difficult (of those still possible at all) language to encode
in eight bits is actually Vietnamese, which uses the Latin alphabet, due
to the sheer number of accented letters used. Windows' encoding of it
(along with some other lesser used encodings, all for Vietnamese) is the
only 8-bit encoding to use combining accents, in a way unfortunately
incompatible with unicode normalization if naively translated, whereas
VISCII sacrifices a handful of C0 control characters in addition to
fully packing the high half with letters.


-- 
Random832



More information about the Python-list mailing list