Newbie question about text encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Feb 26 18:09:41 EST 2015


Chris Angelico wrote:

> Unicode
> isn't about taking everyone's separate character sets and numbering
> them all so we can reference characters from anywhere; if you wanted
> that, you'd be much better off with something that lets you specify a
> code page in 16 bits and a character in 8, which is roughly the same
> size as Unicode anyway.

Well, except for the approximately 25% of people in the world whose native
language has more than 256 characters.

It sounds like you are referring to some sort of "shift code" system. Some
legacy East Asian encodings use a similar scheme, and depending on how they
are implemented they have great disadvantages. For example, Shift-JIS
suffers from a number of weaknesses including that a single byte corrupted
in transmission can cause large swaths of the following text to be
corrupted. With Unicode, a single corrupted byte can only corrupt a single
code point.


-- 
Steven




More information about the Python-list mailing list