A few questiosn about encoding

Thu Jun 20 07:43:28 EDT 2013

On 20/06/2013 07:26, Steven D'Aprano wrote:
> On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:
>
>> On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:
>>
>>> Gah! That's twice I've screwed that up. Sorry about that!
>>
>> Yeah, and your difficulty explaining the Unicode implementation reminds
>> me of a passage from the Python zen:
>>
>>  "If the implementation is hard to explain, it's a bad idea."
>
> The *implementation* is easy to explain. It's the names of the encodings
> which I get tangled up in.
>
You're off by one below!
>
> ASCII: Supports exactly 127 code points, each of which takes up exactly 7
> bits. Each code point represents a character.
>
128 codepoints.

> Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
> about a gazillion other legacy charsets, all of which are mutually
> incompatible: supports anything from 127 to 65535 different code points,
> usually under 256.
>
128 to 65536 codepoints.

> UCS-2: Supports exactly 65535 code points, each of which takes up exactly
> two bytes. That's fewer than required, so it is obsoleted by:
>
65536 codepoints.

etc.

> UTF-16: Supports all 1114111 code points in the Unicode charset, using a
> variable-width system where the most popular characters use exactly two-
> bytes and the remaining ones use a pair of characters.
>
> UCS-4: Supports exactly 4294967295 code points, each of which takes up
> exactly four bytes. That is more than needed for the Unicode charset, so
> this is obsoleted by:
>
> UTF-32: Supports all 1114111 code points, using exactly four bytes each.
> Code points outside of the range 0 through 1114111 inclusive are an error.
>
> UTF-8: Supports all 1114111 code points, using a variable-width system
> where popular ASCII characters require 1 byte, and others use 2, 3 or 4
> bytes as needed.
>
>
> Ignoring the legacy charsets, only UTF-16 is a terribly complicated
> implementation, due to the surrogate pairs. But even that is not too bad.
> The real complication comes from the interactions between systems which
> use different encodings, and that's nothing to do with Unicode.
>
>