Unicode utf-8 doesn't do back-and-forth?

Tim Peters tim_one at email.msn.com
Wed Jul 10 01:22:09 EDT 2002


[John Machin]
> 4 more bits? It needs 21 bits to encode the 2**20 possible
> surrogate-described characters plus the basic 64K characters.
>     assert 21 - 16 == 5

[Martin v. Loewis]
> Not really. This makes a total of 2**20+2**16 = 1114112
> characters. Now, math.log(1114112)/math.log(2) is 20.087462841250343,
> so it is rather 4.09 additional bits.

[John]
> (1) Shouldn't you deduct the 2048 surrogates from the count?

If Martin were feeling anal about this, he would have asked why you said
"characters" instead of "code points".  Since he didn't, I have to assume
he's trying to be informative instead <wink>.

> (2) Why did you round up to two decimal places and not zero decimal
> places? Can you buy 4.09 cans of beer?

I can't, but if there are 1114112 possible "beer points", and you use a full
21 bits to *encode* each possibility, you're just wasting precious storage
<wink>.  For example, the straightforward Huffman beer encoding, assuming
equal probabilities, uses less than 20.12 bits per beer point on average
(2**20-2**16 points are assigned 20-bit codes, and the remaining 2*2**16
each get a 21-bit code).

a-fractional-bit-saved-is-a-fractional-bit-earned-ly y'rs  - tim






More information about the Python-list mailing list