A few questiosn about encoding

Thu Jun 20 02:26:17 EDT 2013

On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:

> On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:
>  
>> Gah! That's twice I've screwed that up. Sorry about that!
> 
> Yeah, and your difficulty explaining the Unicode implementation reminds
> me of a passage from the Python zen:
> 
>  "If the implementation is hard to explain, it's a bad idea."

The *implementation* is easy to explain. It's the names of the encodings 
which I get tangled up in.

ASCII: Supports exactly 127 code points, each of which takes up exactly 7 
bits. Each code point represents a character.

Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and 
about a gazillion other legacy charsets, all of which are mutually 
incompatible: supports anything from 127 to 65535 different code points, 
usually under 256.

UCS-2: Supports exactly 65535 code points, each of which takes up exactly 
two bytes. That's fewer than required, so it is obsoleted by:

UTF-16: Supports all 1114111 code points in the Unicode charset, using a 
variable-width system where the most popular characters use exactly two-
bytes and the remaining ones use a pair of characters.

UCS-4: Supports exactly 4294967295 code points, each of which takes up 
exactly four bytes. That is more than needed for the Unicode charset, so 
this is obsoleted by:

UTF-32: Supports all 1114111 code points, using exactly four bytes each. 
Code points outside of the range 0 through 1114111 inclusive are an error.

UTF-8: Supports all 1114111 code points, using a variable-width system 
where popular ASCII characters require 1 byte, and others use 2, 3 or 4 
bytes as needed.

Ignoring the legacy charsets, only UTF-16 is a terribly complicated 
implementation, due to the surrogate pairs. But even that is not too bad. 
The real complication comes from the interactions between systems which 
use different encodings, and that's nothing to do with Unicode.

-- 
Steven